When we develop real functions, we are often confronted with the task of designing a data representation for complicated forms of information. The best strategy to approach this task is apply a well-known scientific technique: ITERATIVE REFINEMENT. A scientist's problem is to represent a part of the real world using mathematics. The result of the effort is called a MODEL. The scientist then tests the model in many ways, in particular by predicting certain properties of events. If the model truly captured the essential elements of the real world, the prediction will be accurate; otherwise, there will be discrepancies between the predictions and the actual outcomes. For example, a physicist may start by representing a jet plane as a point and by predicting its movement in a straight line using Newton's equations. Later, if there is a need to understand the plane's friction, the physicist may add certain aspects of the jet plane's contour to the model. In general, a scientist refines a model and retests its usefulness until it is sufficiently accurate.
A programmer or a computing scientist should proceed like a scientist. Since the representation of data plays a central role in the work of a programmer, the key is to find an accurate data representation of the real-world information. The best way to get there in complicated situations is to develop the representation in an iterative manner, starting with the essential elements and adding more attributes when the current model is fully understood.
In this book, we have encountered iterative refinement in many of our extended exercises. For example, the exercise on moving pictures started with simple circles and rectangles; later on we developed programs for moving entire lists of shapes. Similarly, we first introduced Web pages as a list of words and embedded Web pages; in section 15.3 we refined the representation of embedded Web pages. For all of these exercises, however, the refinement was built into the presentation.
This section illustrates iterative refinement as a principle of program development. The goal is to model a file system. A file system is that part of the computer that remembers programs and data when the computer is turned off. We first discuss files in more detail and then iteratively develop three data representations. The last part of the section suggests some programming exercises for the final model. We will use iterative refinement again in later sections.
When we turn a computer off, it should remember the functions and the data we worked on. Otherwise we have to reenter everything when we turn it on again. Things that a computer is to remember for a long time are put into files. A file is a sequence of small pieces of data. For our purposes, a file resembles a list; we ignore why and how a computer stores a file in a permanent manner.
|
It is more important to us that, on most computer systems, the collection of files is organized in directories.40 Roughly speaking, a directory contains some files and some more directories. The latter are called subdirectories and may contain yet more subdirectories and files, and so on. The entire collection is collectively called a file system or a directory tree.
Figure 44 contains a graphical sketch of a small directory tree.41 The tree's root directory is TS. It contains one file, called read!, and two subdirectories, called Text and Libs. The first subdirectory, Text, contains only three files; the latter, Libs, contains only two subdirectories, each of which contains at least one file. Each box has one of two annotations. A directory is annotated with DIR, and a file is annotated with a number, which signifies the file's size. Altogether TS contains seven files and consists of five (sub)directories.
Exercise 16.1.1. How many times does a file name read! occur in the directory tree TS? What is the total size of all the files in the tree? How deep is the tree (how many levels does it contain)? Solution
Let's develop a data representation for file systems using the method of iterative refinement. The first decision we need to make is what to focus on and what to ignore.
Consider the directory tree in figure 44 and let's imagine how it is created. When a user first creates a directory, it is empty. As time goes by, the user adds files and directories. In general, a user refers to files by names but thinks of directories as containers of other things.
Model 1: Our thought experiment suggests that our first and most primitive model should focus on files as atomic entities, say, a symbol that represents a file's name, and on the directories' nature as containers. More concretely, we should think of a directory as just a list that contains files and directories.
All of this suggests the following two data definitions:
A directory (short: dir) is either
empty
;
(cons f d)
where f
is a file
and d
is a dir
; or
(cons d1 d2)
where d1
and d2
are dir
s.
The first data definition says that files are represented by
their names. The second one captures how a directory is gradually
cons
tructed by adding files and directories.
A closer look at the second data definition shows that the class of directories is the class of Web pages of section 14.3. Hence we can reuse the template for Web-page processing functions to process directory trees. If we were to write a function that consumes a directory (tree) and counts how many files are contained, it would be identical to a function that counts the number of words in a Web tree.
Exercise 16.2.1. Translate the file system in figure 44 into a Scheme representation according to model 1. Solution
Exercise 16.2.2.
Develop the function how-many
, which consumes a dir
and
produces the number of files in the dir
tree.
Solution
Model 2: While the first data definition is familiar to us and easy to use, it obscures the nature of directories. In particular, it hides the fact that a directory is not just a collection of files and directories but has several interesting attributes. To model directories in a more faithful manner, we must introduce a structure that collects all relevant properties of a directory. Here is a minimal structure definition:
(define-struct dir (name content))
It suggests that a directory has a name and a content; other attributes can now be added as needed.
The intention of the new definition is that a directory has two attributes: a name, which is a symbol, and a content, which is a list of files and directories. This, in turn, suggests the following data definitions:
A directory (short: dir
) is a structure:
(make-dir n c)
n
is a symbol
and c
is a list of files and directories.
A list-of-files-and-directories (short: LOFD) is either
empty
;
(cons f d)
where f
is a file and d
is a LOFD
; or
(cons d1 d2)
where d1
is a dir
and d2
is a LOFD
.
Since the data definition for dir
refers to the
definition for LOFD
s, and the definition for LOFD
s refers
back to that of dir
s, the two are mutually recursive definitions
and must be introduced together.
Roughly speaking, the two definitions are related like those of parent
and
list-of-children
in section 15.1. This, in turn, means
that the design recipe for programming from section 15.2
directly applies to dir
s and LOFD
s. More concretely, to design a
function that processes dir
s, we must develop templates for
dir
-processing functions and LOFD
-processing functions
in parallel.
Exercise 16.2.3. Show how to model a directory with two more attributes: a size and a systems attribute. The former measures how much space the directory itself (as opposed to its files and subdirectories) consumes; the latter specifies whether the directory is recognized by the operating system. Solution
Exercise 16.2.4. Translate the file system in figure 44 into a Scheme representation according to model 2. Solution
Exercise 16.2.5.
Develop the function how-many
, which consumes a dir
according to
model 2 and produces the number of files in the dir
tree.
Solution
Model 3: The second data definition refined the first one with the introduction of attributes for directories. Files also have attributes. To model those, we proceed just as above. First, we define a structure for files:
(define-struct file (name size content))
Second, we provide a data definition:
(make-file n s x)
n
is a symbol, s
is a number,
and x
is some Scheme value.
For now, we think of the content
field of a file as set
to empty
. Later, we will discuss how to get access to the data in
a file.
Finally, let's split the content
field of dir
s into two
pieces: one for a list of files and one for a list of subdirectories. The
data definition for a list of files is straightforward and relies on
nothing but the definition for file
s:
empty
, or
(cons s lof)
where s
is a file
and lof
is a list of files.
In contrast, the data definitions for dir
s and its list of subdirectories
still refer to each other and must therefore be introduced together. Of course, we
first need a structure definition for dir
s that has a field for files and
another one for subdirectories:
(define-struct dir (name dirs files))
Here are the data definitions:
A dir is a structure:
(make-dir n ds fs)
n
is a symbol, ds
is a list of directories, and fs
is a list of files.
A list-of-directories is either
empty
or
(cons s lod)
where s
is a dir
and lod
is a list of directories.
This third model (or data representation) of a directory hierarchy captures the nature of a file system as a user typically perceives it. With two structure definitions and four data definitions, it is, however, far more complicated than the first model. But, by starting with a the simple representation of the first model and refining it step by step, we have gained a good understanding of how to work with this complex web of classes. It is now our job to use the design recipe from section 15.2 for developing functions on this set of data definitions. Otherwise, we cannot hope to understand our functions at all.
The goal of the following sequence of exercises is to develop several common utility functions for directory and file systems, using our third and most refined model. Even though these functions process Scheme-based representations of files and directories, they give us a good idea how such real-world programs work.
Exercise 16.3.1.
Translate the file system in figure 44 into a Scheme
representation. Remember to use empty
for the content of the
files.
Solution
To make the exercise more realistic, DrScheme supports the teachpack dir.ss. It introduces the two necessary structure definitions and a function to create representations of directories according to our third model:
;;create-dir : string -> dir
;; to create a representation of the directory thata-path
specifies: ;; 1. Windows:(create-dir "c:\\windows")
;; 2. Mac:(create-dir "My Disk:")
;; 3. Unix:(create-dir "/home/scheme/")
(define (create-dir a-path) ...)
Use the function to create some small and large examples based on the
directories in a real computer. Warning: For large directory trees,
DrScheme may need a lot of time to build a representation. Use
create-dir
on small directory trees first. Do not define
your own dir
structures.
Exercise 16.3.2.
Develop the function how-many
, which consumes a dir
(according to
model 3) and produces the number of files in the dir
tree. Test the
function on the directories created in exercise 16.3.1. Why are we
confident that the function produces correct results?
Solution
Exercise 16.3.3.
Develop the function du-dir
. The function consumes a directory and
computes the total size of all the files in the entire directory tree. This
function approximates a true disk-usage meter in that it assumes that
directories don't require storage.
Refine the function to compute approximate sizes for subdirectories. Let's assume
that storing a file and a directory in a dir
structure costs 1
storage unit.
Solution
Exercise 16.3.4.
Develop the function find?
, which consumes a dir
and a file name
and determines whether or not a file with this name occurs in the directory tree.
Challenge: Develop the function find
. It consumes a
directory d
and a file name f
. If (find? d f)
is true,
it produces a path to the file; otherwise it produces false
. A path is a list
of directory names. The first one is that of the given directory; the last one is
that of the subdirectory whose files
list contains f
. For
example:
(find TS 'part3) ;; expected value: (list 'TS 'Text)
(find TS 'read!) ;; expected value: (list 'TS)
assuming TS
is define
d to be the directory in
figure 44.
Which read! file in figure 44 should find
discover? Generalize the function to return a list of paths if the file name occurs
more than once. Each path should lead to a different occurrence, and there should
be a path for each occurrence.
Solution
40 On some computers, a directory is called a folder.
41 The picture explains why computer scientists call such directories trees.