Part of the art of content management for an online learning community
is reducing the number of types of content. For example, consider a
community where the publisher says "I want articles [magnet content],
comments from users on articles, news from the publisher, comments on
news from users, questions from users, and answers to questions." A
naive implementation from these specifications would result in the
creation of six database tables: articles, comments_on_articles,
news, comments_on_news, questions, answers
. From the RDBMS's
perspective, there is nothing overwhelming about six tables. But
consider that every new table defined in the RDBMS implies roughly twenty
Web scripts. Ten of these scripts will constitute a user experience:
view a directory of content in Table A, view one category, view one
item, view the newest items, grab a form to insert an item, confirm
insertion, request an email alert of comments on an item. Ten of
these scripts will constitute an administrator's experience: view a
directory of content in Table A, view one category, view one item,
view the newest items, approve an item, disapprove an item, delete an
item, confirm deletion of an item, etc. It will be a bit tough to
code these twenty scripts in a general fashion because the SQL statements will differ
in at least the table names used.
Consider further that to offer a complete index of site content, you'll have to write a program that pulls text from at least six tables into a single index.
How different are these six kinds of content, really? We'll look at the tables that we need to define for storing articles, then proceed to the other types of content.
create table articles (
article_id integer primary key,
-- who contributed this and when
creation_user not null references users,
creation_date not null date,
-- what language is this in?
-- visit http://www.w3.org/International/O-charset-lang
-- to see the allowable 2-character codes (en is English, ja is Japanese)
language char(2) references language_codes,
-- could be text/html or text/plain or some sort of XML document
mime_type varchar(100) not null,
-- will hold the title in most cases
one_line_summary varchar(200) not null,
-- the entire article; 4 GB limit
body clob
);
Should all articles in the database be shown to all users? Perhaps it
would be nice to have the ability to store an article and hold it for
editorial examination:
create table articles (
article_id integer primary key,
creation_user not null references users,
creation_date not null date,
language char(2) references language_codes,
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
body clob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
Do you trust all the programmers in your organization to remember to
include a where editorial_status = 'approved'
clause in
every script on the site? If not, perhaps it would be better to
rename the table altogether and build a view for use by application
programmers:
create table articles_raw (
article_id integer primary key,
...
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
create view articles_approved
as
select *
from articles_raw
where editorial_status = 'approved';
If you change your mind about how to represent approval status, you
won't need to update dozens of Web scripts; you need only change the
definition of the articles_approved
view. (See the views
chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/views
for more on this idea of using SQL views as a means of programming
abstraction.)
If a friend of yours came back from this place and showed these
slides, you'd expect to hear something much richer and more
interesting than the Michelin Guide's sentence. A literate culture
operates with the implicit assumption that knowledge is closed, that
Italian tourism can fit into a book. Perhaps the 350 pages of the
Green Guide aren't enough, but some quantity of writers and pages
would suffice to encapsulate everything worth knowing about Italy.
Comments are often the most interesting material on a site. Here's
one from http://philip.greenspun.com/humor/bill-gates:
"I must say, that all of you who do not recognize the absolute genius of Bill Gates are stupid. You say that bill gates stole this operating system. Hmm.. i find this interesting. If he stole it from steve jobs, why hasn't Mr. Jobs relentlessly sued him and such. Because Mr. Jobs has no basis to support this. Macintosh operates NOTHING like Windows 3.1 or Win 95/NT/98. Now for the mac dissing. Mac's are good for 1 thing. Graphics. Thats all. Anything else a mac sucks at. You look in all the elementary schools of america.. You wont see a PC. Youll see a mac. Why? Because Mac's are only used by people with undeveloped brains." -- Allen ([email protected]), August 10, 1998 |
Oral cultures do not share this belief. Knowledge is open-ended. People may hold differing opinions without one person being wrong. There is not necessarily one truth; there may be many truths. Though he didn't grow up in an oral culture, Shakespeare knew this. Watch Troilus and Cressida and its five perspectives on the nature of a woman's love and try to figure out which perspective Shakespeare thinks is correct.
Feminists, chauvinists, warmongers, pacifists, Jew-haters, inclusivists, cautious people, heedless people, misers, doctors, medical malpractice lawyers, atheists, and the pious are all able to quote Shakespeare in support of their beliefs. That's because Shakespeare uses the multiple characters in each of his plays to show his culture's multiple truths.
In the 400 years since Shakespeare we've become much more literate. There is usually one dominant truth. Sometimes this is because we've truly figured something out. It is tough to argue that a physics textbook on Newtonian mechanics should be an open-ended discussion (though a user comment facility might still be very useful in providing clarifying explanations for confusing sections). Yet even in the natural sciences, one can find many examples in which the culture of literacy distorts discourse.
Academic journals of taxonomic botany reveal disagreement on whether Specimen 947 collected from a particular field in Montana is a member of species X or species Y. But the journals imply agreement on the taxonomy, i.e., on how to build a categorization tree for the various species. If you were to eavesdrop on a cocktail party in a university's department of botany, you'd discover that even this agreement is illusory. There is widespread disagreement on what constitutes the correct taxonomy. Hardly anyone believes that the taxonomy used in journals is correct, but botanists have to stick with it for publication because otherwise older journal articles would be rendered incomprehensible. Taxonomic botany based on an oral culture or a computer system capable of showing multiple views would look completely different.
The Internet and computers, used competently and creatively, make it much easier and cheaper to collect and present multiple truths than in the old world of print, telephone, and snail mail. Multiple-truth Web sites are much more interesting than single-truth Web sites and, per unit of effort and money invested, much more effective at educating users.
create table comments_on_articles_raw (
comment_id integer primary key,
-- on what article is this a comment?
refers_to not null references articles,
creation_user not null references users,
creation_date not null date,
language char(2) references language_codes,
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
body clob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
create view comments_on_articles_approved
as
select *
from comments_on_articles_raw
where editorial_status = 'approved';
This table differs from the articles
table only in a
single column: refers_to
. How about combining the two:
create table content_raw (
content_id integer primary key,
-- if not NULL, this row represents a comment
refers_to references content_raw,
-- who contributed this and when
creation_user not null references users,
creation_date not null date,
-- what language is this in?
-- visit http://www.w3.org/International/O-charset-lang
-- to see the allowable 2-character codes (en is English, ja is Japanese)
language char(2) references language_codes,
-- could be text/html or text/plain or some sort of XML document
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
-- the entire article; 4 GB limit
body clob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
-- if we want to be able to write some scripts without having to think
-- about the fact that different content types are merged
create view articles_approved
as
select *
from content_raw
where refers_to is null
and editorial_status = 'approved';
create view comments_on_articles_approved
as
select *
from content_raw
where refers_to is not null
and editorial_status = 'approved';
-- let's build a single full-text index on both articles and comments
-- using Oracle Intermedia Text (formerly known as "Context")
create index content_ctx on content_raw (body)
indextype is ctxsys.context;
Often a company's press release will be tagged "for release Monday,
April 17." If a publisher wants to continue receiving press
releases from this company, it will respect these dates. This implies
the need for a release_time
column in the news data
model.
Other than these two columns (expiration_time
and
release_time
), it would seem that a news story needs more
or less the same columns as articles: a place for a one-line summary,
a place for the body of the story, a way to indicate authorship, a way
to indicate approval within the editorial workflow.
Upon further reflection, however, perhaps these columns could be
useful for all site content. An article on upgrading from Windows
2000 to Windows XP probably should be set to expire in 2006. If a
bunch of authors and editors are working on a major site update,
perhaps it would be nice to synchronize the release of the new content
for Tuesday at midnight. Let's go back to content_raw
:
create table content_raw (
content_id integer primary key,
refers_to references content_raw,
creation_user not null references users,
creation_date not null date,
release_time date, -- NULL means "immediate"
expiration_time date, -- NULL means "never expires"
language char(2) references language_codes,
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
body clob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
How do we find news stories amongst all the content rows? What
distinguishes a news story with a scheduled release time and
expiration date from an article on the Windows 2003 operating system
with a scheduled release time and expiration date? We'll need one more
column:
create table content_raw (
content_id integer primary key,
content_type varchar(100) not null,
refers_to references content,
creation_user not null references users,
creation_date not null date,
release_time date,
expiration_time date,
language char(2) references language_codes,
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
body clob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
create view news_current_and_approved
as
select *
from content_raw
where content_type = 'news'
and (release_time is null or sysdate >= release_time)
and (expiration_time is null or sysdate <= expiration_time)
and editorial_status = 'approved';
Notice the explicit checks for NULL in the view definition above.
You'd think that something simpler such as
and sysdate between release_time and expiration_time
would work. The problem here is SQL's three-valued logic. For the
RDBMS to return a row, all of the AND clauses must return true. NULL
is not true. Any expression or calculation including a NULL evaluates
to NULL. Thus
where sysdate >= release_time
will exclude any rows where release_time
is NULL.
refers_to
will be NULL. For content that is not
temporal, the release and expiration times will be NULL. Otherwise,
most of the columns will be filled most of the time.
What about questions and answers in a discussion forum? If there is
only one forum on the site, we can simply add rows to the
content_raw
table with a content_type
of
"forum_posting" and query for the questions by checking
refers_to is null
. On a site with multiple forums, we'd
need to add a parent_id
column to indicate under which
topic a given question falls. Within a forum with many archived
posts, we'll also need some way of storing categorization, e.g., "this
is a Darkroom question". See http://www.photo.net/bboard/
for a running example of a multi-forum system in which older postings
are categorized. The "Discussion" chapter of this book treats this
subject in more detail.
One good thing about the file system is that there are a lot of tools for users with different levels of skill to add, update, remove, and rename files. Programmers can use text editors. Designers can use Web design tools and FTP the results. Page authors can use HTML editors such as Microsoft Front Page.
One bad thing about giving many people access to the file system is the potential for chaos. A designer is supposed to upload a template, but ends up removing a script by mistake. Now users can't log into the site anymore. The standard Windows and Unix file systems aren't versioned. It isn't possible to go back and ask "What did this file look like six months ago?" The file system does not by itself support any workflow (see below). You authorize someone to modify a file or not. You can't say "User 37 is authorized to update this article on aquarium filters, but the members shouldn't see that update until it is approved by an editor."
The deepest problem with using the file system as a cornerstone of your content management system is that files are outside of the database. You will need to store a lot of references to content in the database, e.g., "User 960 is the author of Article 231", "Comment 912 is a comment on Article 529", etc. It is very difficult to keep a set of consistent references to things outside the RDBMS. Suppose that your RDBMS tables are referring to file system files by file name. Someone renames a file. The database doesn't know. The database's referential integrity constraint mechanisms cannot be invoked to protect against this circumstance. It is much easier to keep a set of data structures consistent if they are all within the RDBMS.
Static .html files also have the problem of being, well, static. Suppose that you want a standard header and footer on every page. You can cut and paste these into every .html file on the system. But what if you want to change "Copyright 2003" to "Copyright 2006" in the site-wide footer? You may have to update thousands of files. Suppose that you want the header to include a "Login" link if the request comes in with no user authorization cookie and a "Logout" link if the request comes in from a registered user.
Some of the problems with publisher maintenance of static .html files can be solved by periodically writing and running clever Perl scripts. Deeper problems with the user experience remain, however. First and foremost is the fact that with a static .html file every person who views the page thinks that he or she might be the only person ever to have viewed the page. This makes for a very lonely Internet experience and, generally speaking, not a very profitable one for the publisher.
A sustainable online business will typically offer some sort of online community interaction anchored by its content and will offer a consistently personalized user experience. These requirements entail some sort of computer program executing on every page load. So you might as well take this to its logical conclusion and build every URL in your application the same way: script in the file system executes and pulls content from the RDBMS.
You might find that, in making the data model precise with SQL table definitions, questions for the client arise. You realize that your earlier discussions with the client were too vague in some areas. This is a natural consequence of building a SQL data model. Pick up the phone and call your client to get clarifications. Email with several alternative concrete scenarios. Get your client accustomed to fielding questions in a timely manner.
Show the draft data model to your teaching assistant and discuss with other students before proceeding.
The information designer decides what navigational links are available from every document on the page, how to present the available content sections, and what graphic design elements are required.
The graphic designer contributes drawings, logos, and other artwork in service of the information designer's objectives. The graphic designer also produces mock-up templates (static HTML files) in which these artwork elements are used.
The programmer builds production templates and computer programs that reflect the instructions of publisher, information designer, and graphic designer.
Editors approve content and decide when specific pages go live. Editors assign relative prominence among pages within sections.
In keeping with their relative financial compensation, we consider the needs and contributions of authors second to last. Authors stuff fragments of HTML, plain text, photographs, music, and sound, into the database. These authored entities will be viewed by users only through the templates developed by the programmers.
Below is an example workflow that we used to assign to students at MIT:
Your "practice project" will be a content management system to support a
guide to Boston, along the lines of the AOL City Guide at http://home.digitalcity.com/boston/. You
will need to produce a design document and a prototype implementation.
The prototype implementation should be able to support the following
scenario:
|
Tip: we recommend modeling workflow as a finite-state machine in which a content item can be in only one state at a time and that single state tells you everything that you need to know about the item. In other words, your software can take action without ever needing to go back and look to see what states the article was in previously.
Version control becomes critical for preventing lost updates when people are working together. Here's how a lost update can happen:
Programmers and technical writers at large companies are familiar with the problem of lost updates when multiple people are editing the same document. File-system based version control systems were developed to help coordinate multiple contributors. These systems include the original Walter Tichy's Revision Control System (RCS; early 1980s), Dick Grune and Brian Berliner's Concurrent Versions System (CVS; 1986), and Marc Rochkind's Source Code Control System (SCCS; 1972). These systems require more training than is practical for casual users. For example, RCS mandates explicit check-out and check-in. While a file is checked out by User A it is locked and nobody but User A can check it back in. Suppose that User A goes out to lunch, but there is some important news that absolutely must be put on the site. What if User A leaves for a two-week vacation and forgets to check a bunch of files back in? These problems can be worked around manually, but it becomes a challenge when the collaborators are on opposite sides of the globe and cannot see "Oh, Schlomo's coat is still on the back of his chair so he's not yet left for the day."
For distributed authorship of Web content by geographically distributed casually connected users, the most practical system turns out to be one in which check-in is allowed at any time by any authorized person. However, all versions of every document are kept in the database so that one can always revert to an earlier version or pull a section out of an earlier version. This implies that your content management system will have an audit trail: a record of past values held by row-column intersections in a database table, who was responsible for any changes in those values, and when the values were changed.
There are two classical ways to implement an audit trail in an RDBMS. The first is to set up separate audit tables, one for each production table. Every time an update is made to a production table, the old row is written out to an audit table, with a time stamp. This can be accomplished transparently via RDBMS triggers, which are described in the "Triggers" chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/triggers and demonstrated in practice in an open-source audit trail package documented at http://philip.greenspun.com/seia/examples-content-management/audit-acs-doc. The second classical approach is to keep current and archived information in the same table. This is more expensive in terms of computing resources required because the information that you want for the live site is interspersed with seldom-retrieved archived information. But it is easier if you want to program in the capability to show the site as it was on a particular day. Your templates won't have to query a different table, they will merely need a different WHERE clause.
Michael Stonebraker, a professor at University of California Berkeley,
looked at this problem around 1990 and decided to build an RDBMS with,
among other advanced features, native support for versioning. This
became the PostgreSQL open-source RDBMS. The original PostgreSQL had
a "no-overwrite architecture" in which a change to a row resulted in a
complete new version of that row being written out to the disk. Thus
the hard disk drive contained all previous versions of every row in the
table. A programmer could select * from
content_table['epoch','1995-01-01'] ...
to get all versions
from the beginning of time ("epoch") until January 1, 1995. This
innovation made for some nice articles in academic journals, but
execrable transaction processing performance. The modern PostgreSQL
scrapped this idea in favor of Oracle-style write-ahead logging in
which only updates are written to the hard drive (see the "Write-Ahead
Logging" chapter of the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/wal.html).
create table content_raw (
content_id integer primary key,
content_type varchar(100) not null,
refers_to references content_raw,
creation_user not null references users,
creation_date not null date,
release_time date,
expiration_time date,
-- some of our content is geographically specific
zip_code varchar(5),
-- a lot of our readers will appreciate Spanish versions
language char(2) references language_codes,
mime_type varchar(100) not null,
one_line_summary varchar(200) not null,
-- let's use BLOB in case this is a Microsoft Word doc or JPEG
-- a BLOB can also hold HTML or plain text
body blob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired'))
);
If this table were to contain seven versions of an article with a Content ID
of 5657 that would violate the primary key constraint on the
content_id
column. What if we remove the primary key
constraint? In Oracle this prevents us from establishing referential
integrity constraints pointing to this ID. With no integrity
constraints, we will be running the risk, for example, that our
database will contain comments on content items that have been
deleted. With multiple rows for each content item, our pointers become
ambiguous. The statement "User 739 has read Article 5657" points from
a specific row in the users
table into a set of rows in
the content_raw
. Should we try to be more specific? Do
we want a comment on an article to refer to a specific version of that
article? Do we want to know that a reader has read a specific version
of an article? Do we want to know that an editor has approved a
specific version of an article? It depends. For some purposes, we
probably do want to point to a version, e.g., for approval, and at
other times we want to point to the article in the abstract. If we
add a version_number
column, this becomes relatively
straightforward.
create table content_raw (
-- the combination of these two is the key
content_id integer,
version_number integer,
...
primary key (content_id, version_number)
);
Retrieving information for a specific version is easy. Retrieving
information that is the same across multiple versions of a content
item becomes clumsy and requires a GROUP BY, since we want to collapse
information from several rows into a one-row report:
-- note the use of MAX on VARCHAR column; this works just fine
select content_id, max(zip_code)
from content_raw
where content_id = 5657
group by content_id
We're not really interested in the largest ZIP code for a particular
content item version. In fact, unless there has been some kind of
mistake in our application code, we assume that all ZIP codes for
multiple versions of the same content item are the same. However,
GROUP BY is a mechanism for collapsing information from multiple rows.
The SELECT list can contain column names only for those columns that
are being GROUPed BY. Anything else in the SELECT list must be the
result of aggregating the multiple values for columns that aren't
GROUPed. The choices with most RDBMSes are pretty limited: MAX, MIN,
AVERAGE, SUM. There is no "pick any" function. So we use MAX.
Updates are similarly problematic. The U.S. Postal Service periodically redraws the ZIP code maps. Updating one piece of information, e.g., "20016" to "20816", will touch more than one row per content item.
This data model is in First Normal Form. Every value is available at
the intersection of a table name, column name, and key (the composite
primary key of content_id
and
version_number
). However, it is not in Second Normal
Form, which is why our queries and updates appear strange.
In Second Normal Form, all columns are functionally dependent on the whole key. Less formally, a Second Normal Form table is one that is in First Normal Form with a key that determines all non-key column values. Even less formally, a Second Normal Form table contains statements about only one kind of thing.
Our current content_raw
table contains some information
that depends on the whole key of content_id
and
version_number
, e.g., the body and the language code.
But much of the information depends only on the
content_id
portion of the key: author, creation time,
release time, ZIP code.
When we need to store statements about two different kinds of things, it makes sense to create two different tables, i.e., to use Second Formal Form:
-- stuff about an item that doesn't change from version to version
create table content_raw (
content_id integer primary key,
content_type varchar(100) not null,
refers_to references content_raw,
creation_user not null references users,
creation_date not null date,
release_time date,
expiration_time date,
mime_type varchar(100) not null,
zip_code varchar(5)
);
-- stuff about a version of an item
create table content_versions (
version_id integer primary key,
content_id not null references content_raw,
version_date date not null,
language char(2) references language_codes,
one_line_summary varchar(200) not null,
body blob,
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired')),
-- audit the person who made the last change to editorial status
editor_id references users,
editorial_status_date date
);
How does one query into the versions table and find the latest
version? A first try might look something like the following:
select *
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and version_date = (select max(version_date)
from content_versions
where content_id = 5657
and editorial_status = 'approved')
Is this guaranteed to return only one row? No! There is no unique
constraint on content_id, version_date
. In theory, two
editors or authors could submit new versions of an item within
the same second. Remember that the date datatype in Oracle is precise
only to within one second. Even more likely is that an editor doing a
revision might click on an editing form submit button twice with the
mouse or perhaps use the Reload command impatiently. Here's a slight
improvement:
select *
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and version_id = (select max(version_id)
from content_versions
where content_id = 5657
and editorial_status = 'approved')
The version_id
column is constrained unique, but we're
relying on unstated knowledge of our application code, i.e., that
version_id
will be larger for later versions.
Some RDBMS implementations have extended the SQL language so that you can ask for the first row returned by a query. A brief look at the Oracle manual would lead one to try
select *
from content_versions
where content_id = 5657
and editorial_status = 'approved'
and rownum = 1
order by version_date desc
but a deeper reading of the manual would reveal that the
rownum
pseudo-column is set before the ORDER BY clause is
processed. An accepted way to do this in one query is the nested
SELECT:
select *
from (select *
from content_versions
where content_id = 5657
and editorial_status = 'approved'
order by version_date desc)
where rownum = 1;
Another common style of programming in SQL that may seem surprising is
taking the following steps:
select *
from content_versions
where content_id = 5657
and editorial_status = 'approved'
order by version_date desc
version_date
)
create table content_versions (
version_id integer primary key,
content_id not null references content_raw,
version_date date not null,
...
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired')),
current_version_p char(1) check(current_version_p in ('t','f')),
...
);
The new current_version_p
column can be maintained via a
trigger that runs after insert or update and examines the
version_date
and editorial_status
columns.
Querying for user pages can be simplified with the following view:
create view live_versions
as
select *
from content_versions
where current_version_p = 't';
Modern commercial RDBMS implementations offer a feature via which rows
in a table can be spread across different tablespaces, each of which
is located on a physically separate disk drive. In Oracle, this is
referred to as partitioning:
create table content_versions (
version_id integer primary key,
content_id not null references content_raw,
version_date date not null,
...
editorial_status varchar(30)
check (editorial_status in ('submitted','rejected','approved','expired')),
current_version_p char(1) check(current_version_p in ('t','f')),
...
)
partition by range
(current_version_p)
(partition old_crud values less than 's'
tablespace slow_extra_disk_tablespace
partition live_site values less than(maxvalue)
tablespace fast_new_disk_tablespace)
;
All of the rows for the live site will be kept together in relatively
compact blocks. Even if the ratio of old versions to live content is
99:1 it won't affect performance or the amount of RAM consumed for
caching database blocks from the disk. As soon as Oracle sees a
"WHERE CURRENT_VERSION_P =" clause it knows that it can safely ignore
an entire tablespace and won't bother checking any of the irrelevant
blocks.
Have we reached Nirvana? Not according to the database eggheads,
whose relational calculus formulae do not embrace such factors as how
data are spread among physical disk drives. The database theoretician
would note that our data model is in Second Normal Form but not in
Third Normal Form. In a table that is part of a Third Normal Form
data model, all columns are directly dependent on the whole key. The
column current_version_p
is not dependent on the table
key, but rather on two other non-key columns
(editorial_status
and version_date
). SQL
programmers refer to this kind of performance-enhancing storage of
derivable data as "denormalization".
If you want to serve ten million requests per day directly from an RDBMS running on a server of modest capacity, you may need to break some rules. However, the most maintainable production data models usually result from beginning with Third Normal Form and adding a handful of modest and judicious denormalizations that are documented and justified.
Note that any data model in Third Normal Form is also in Second Normal Form. A data model in Second Normal Form is in First Normal Form.
Throughout this section, keep in mind that a project with a very clear publishing objective, specs that never change, and one very smart developer, does not need version control. A project with evolving objectives, changing specifications, and multiple contributors needs version control.
A shallow objection to this development method in the world of database-backed Internet applications is that it becomes very tedious to make a small change. The programmer checks out the tree onto a development server. The programmer installs an RDBMS, then creates an RDBMS user and a tablespace. The programmer exports the RDBMS from the production site into a dump file, transfers that dump file over the network to the development machine, and imports it into the RDBMS installation on the development server. Keep in mind that for many Internet applications the database may approach one terabyte in size, and therefore it could take hours or days to transfer and import the dump file. Finally, the programmer finds a free IP address or port and sets up an HTTP server rooted at the development tree. Ready to code!
A deeper objection to applying this development method to our world is that it is an obstacle to collaboration. In the Internet application business, developers always work with the publisher and users. Those collaborators need to know, at all times, where to find the latest running version of the software so that they can offer criticism and advice. If there are ten software developers on a service it is not reasonable to ask the publishers and users to check ten separate development sites.
What's wrong with the two-server plan? Nothing if the development and testing teams are the same, in which case there is no possibility of simultaneous development and testing. For a complex site, however, the publisher may wish to spend a week testing before launching a revision. It isn't acceptable to idle authors and developers while a handful of testers bangs away at the development server. The addition of a staging server, rooted at /web/foobar-staging/ (Server 3) allows development to proceed while testers are preparing for the public launch of a new version.
Here's how the three servers are used:
drop table users
rather than
drop table users_experimental_extra_table
to the
database. Or, less dramatically, a junior developer might leave out a
WHERE clause in an SQL statement and inadvertently request a result
set of 109 rows, thus slowing down the production site.
So it would seem that this publisher will need at least one new database. Here are the steps:
After you've got a few articles in, step back and ask the following questions:
Programmers of most applications no longer need concern themselves too much with processor and memory efficiency, which were obsessions in the 1960s. CPU and RAM are available in abundance. But screen real estate is as precious as ever. Look at your page designs. Is the most important information available without scrolling? (In the newspaper business, the term for this is "above the fold".) Are you making the best use of the screen space that you have? Are there large swaths of empty space on the page? Could you be using HTML tables to present two or three columns of information at the same time?
One particularly egregious waste of screen space is the use of icons. Typically, users can't understand what the icons mean so they need to be supplemented with plain language annotation. Generally the best policy is to let the information be the interface, e.g., display a list of article categories (the information) where clicking on a category is the way to navigate to a page showing articles within that category.
Computer and network speeds will change over the years but human beings will evolve much more slowly. Thus we should start by considering limits derived from the humanity of our users. The experimental psychologists will tell us that short-term memory is good for remembering only about seven things at once ("The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information", George A. Miller, The Psychological Review 1956, 63:81-97; http://www.well.com/user/smalin/miller.html) and that this memory is good for only about twenty seconds. It is thus unwise to build any computer application in which users are required to remember too much from one page to another. It is also unwise to build any computer application where the interpage delay is more than twenty seconds. People might forget what task they were trying to accomplish!
IBM Corporation carried out some studies around 1970 and discovered the following required computer response times:
A reasonable goal to strive for in an Internet application is sub-second response time. This goal is based partly on IBM's research, partly on the inability to achieve (in 2005) the 0.1-second mark at which direct manipulation becomes possible, and partly on what is being achieved by the best practitioners. Your users will have used Amazon and Yahoo! and eBay. Any service that is slower than these is going to set off alarm bells in the user's mind: maybe this site is going to fail altogether? Maybe I should try to find a competitive site that does the same job but is faster?
One factor that affects page-loading time is end-to-end bandwidth between your server and the user. You can't do much about this except measure and average. Some Web servers can be configured or reprogrammed to log the total time spent serving a page. By looking at the times spent serving large photographs, for example, you can infer average bandwidth available between your server and the users. If the tenth percentile users are getting 50 Kbits per second, you know that, even if your server were infinitely fast at preparing pages, you should try to make sure that your pages, with graphics, are either no larger than 50 Kbits in size or that the HTML is designed such that the page will render incrementally. (A page that is one big TABLE is bad; a page in which any images have WIDTH and HEIGHT tags is good because the text will be rendered immediately with blank spaces that will be gradually filled in as the images are loaded.)
You can verify your decisions about page layout and graphics heaviness by comparing your pages to those of the most successful Internet service operators such as eBay, Yahoo!, and Amazon.
Remember that in the book and magazine world every page design loads at the same speed, which means that page design is primarily a question of aesthetics. In the Internet world page design and application speed are inextricably linked, which makes page design an engineering problem.
For instructions, you can choose active or passive voice and first, second, or third person. Instructions should be second person imperative. Leave out the pronouns, e.g., "Enter departure date" rather than "Enter your departure date".
Oftentimes you can build a system such that error messages are unnecessary. The best user interfaces are those where the user can't make a mistake. For example, suppose that an application needs to prompt for a date. One could do this with a blank text entry box and no hint, expecting the user to type MM/DD/YYYY, e.g., 09/28/1963 for September 28, 1963. If the user's input did not match this pattern or the date did not exist, e.g., 02/30/2002, the application returns a page explaining the requirements. A minor improvement would be to add a note next to the box: "MM/DD/YYYY". If the application logs showed that the number of error pages served was reduced, but not eliminated, perhaps defaulting the text entry box to today's date in MM/DD/YYYY format would be better. Surf over to your favorite travel site, however, and you'll probably find that they've chosen "none of the above". Users are asked to pick a date from a JavaScript calendar widget or pull down month and day from HTML menus.
Sadly, you won't be able to eliminate the need for all error messages. Thus you'll have to make a choice between terse or verbose and between lazy or energetic. A lazy system will respond "syntax error" to any user input that won't work. An energetic system will try to autocorrect the user's input or at least figure out what is likely to be wrong.
Studies have shown that it is worthwhile to develop sophisticated error-handling pages, e.g., ones that correct the user's input and serve a confirmation page. At the very least, it is worth running some regular expressions against the user's offending input to see if its defects fall into a common pattern that can be explained on an error page. It is best to avoid anthropomorphism—the computer shouldn't say "I didn't understand what you typed".
"The natural world is too green and badly lit."
-- Francois Boucher, 18th century painter |
<body bgcolor=white text=black>
, the
browser will treat your users kindly with familiar link colors. By
this sparing use of color in your interface you'll have most of the
color spectrum available for presenting information: charts, graphs,
photos. Compare www.britneyspears.com and http://britneyspears.ac/physics/basics/basics.htm, for example, to see these principles at work.
Be a bit careful with medium gray tones at the very top of Web pages. Many Web browsers use various shades of gray for the backgrounds of menu and button bars at the top of windows. If a Web page has a solid gray area at the top, a user may have trouble distinguishing where the browser software ends and the page content begins. Notice that pages on Yahoo! and Amazon include a bit of extra white space at the top to separate their page content from the browser location and menu bars.
Whatever scheme you choose, keep it consistent site-wide. In 1876 MIT agreed on cardinal and gray for school colors. See how the agreement is holding up by visiting www.mit.edu, click on "Administration" and then look at the subsites for four departments: IS, Medical, Arts, Disabilities Service.
For an excellent discussion of the use of color, see Macintosh Human Interface Guidelines, available online at http://developer.apple.com/documentation/mac/HIGuidelines/HIGuidelines-2.html. Basically the messages are the following: (1) use color sparingly, (2) make sure that a colorblind person can make full use of the application, and (3) avoid red because of its association with alerts and danger.
On interior pages, it is important to answer the following questions:
Home > Arts > Visual Arts > Photography > PanoramicNote that this bar grows in size as O[log N] where N is the number of pages on the site. Showing a full site map or top tabs results in linear growth.
To answer "Where have I been?", start by not instructing the browser to change the standard link colors. The user will thus be cued by the browser for any links that have already been visited. If you're careful with your programming and consistent with your page titles, the user will be able to right-click on the Back button and optionally return to any previous place on your service. Note further that the Yahoo-style navigation bar is effective at answering "Where have I been?" for users who have actually clicked down from the home page.
To answer "Where can I go?" you need ... links! Let the browser default to standard colors so that users will perceive the links as links. It is generally a bad idea to use rollovers, select boxes, or graphics. These controls won't work the same from site to site and therefore users may not understand how to use them. These controls don't have the property that visited links turn a different color; they generally can't or don't tap into the browser's history database. Finally, these controls aren't effective at showing the user where he or she can go because many of the choices are hidden.
Working with your project team members, write a plain-text critique of each project that you review. Look for situations in which the client's requirements, as expressed in the planning exercise solutions, can't be fulfilled with the data model that you see. Look for opportunities to provide constructive criticism. Remember that your classmates don't need a self-esteem boost; they need the benefit of your engineering skills.
Here are some suggested areas where it might be easy to find improvements:
Email your critique to the team members whose work you've just reviewed. Archive these in a file and make them available at http://yourservername/doc/critiques/cm-sent.txt. Watch your own inbox for critiques coming in from the rest of the class. Please assemble these into one file and make them available at http://yourservername/doc/critiques/cm-received.txt
One reason that the information architecture on a typical site is so ill-suited to the user is that the architecture is implicit in scripts and HTML pages. To test an alternative would involve expensive hand-manipulation of computer programs and HTML markup. To offer an individual user or class of user a custom information architecture would be impossible.
What if we represented information architecture explicitly in database tables? These tables would hold the following information:
For the application that you build in this course, it is acceptable to take the expedient path of pounding out scripts with an implicit information architecture. However, we'd like you to be aware of the power for development and testing that can be gained from an explicit information architecture.
If your client has an existing site, use that as a source of content and minimum requirements. Also look at a couple of sites run by organizations with comparable missions and sizes. For example, if you're building something for an academic group you might look at Harvard University's Department of Molecular and Cellular Biology's Web site at http://www.mcb.harvard.edu/. This site illustrates the basic requirements for a medium-sized organization's Web site. An "overview" section describes the department's purpose and history. A "news" section offers press releases. A "faculty" section explains who works there and what their specialties are. There are also sections for prospective undergraduates and graduate students, i.e., the potential customers for this organization. If you're building something for a small non-profit organization, look at the Web sites for Sustainable Harvest (www.sustainableharvest.org) and the Southern Animal Rescue Association (www.sarasanctuary.org). If you're working for a small manufacturing company, look at www.cirrusdesign.com, the Web site for Cirrus Design Corporation, a Duluth, Minnesota maker of small airplanes.
What if you can't reach your client in time to complete the assignment? Or if you can't get content from your client? Use content from their existing site or a site operated by a similar organization. Make sure that at a minimum there is a lived-in look for a reader who comes to see the "About", "News", and "Contact Us" sections. During the remainder of the course you'll have an opportunity to replace the placeholder content with content from your client.
Note that before embarking on this you may want to read at least the "Separating the Designers and the Programmers" section on templates in the "Software Modularity" chapter.
Record your client's answers to the following questions:
Four minutes isn't very long so you'll need to rehearse and you'll want to make sure that all team members know what they're supposed to do. As a general rule, the person speaking should be addressing the audience, not typing at a computer. Team Member A talks; Team Member B drives. Perhaps at some point in the presentation they switch, but nobody is ever talking and driving a computer at the same time.
Open with an "elevator pitch", i.e., the kind of thirty-second explanation that you'd give to someone you met during an elevator ride. The pitch should explain what problem you're solving and why your system will be better than existing mechanisms available to people.
Create one or more users ahead of time so that you don't have to show your user registration pages. Everyone who has used the Internet has registered at sites. They'll assume that you copied the best practices from amazon.com and other popular sites. If you did, the audience will be bored. If you didn't, the audience will be appalled by your sloppiness. Either way it is best to log in as already-registered users. In fact, sometimes you can arrange to prepare two browsers, e.g., Mozilla and MSIE, one of which is logged in as a new user of the service and one of which is logged in as a site administrator or some other role that you want to demonstrate.
It is best not to refer to "users" during your talk. Instead talk about the roles by name. If, for example, you are building a service around flying, you could say "A student pilot logs in [your teammate logs in], finds an article on flight schools in San Francisco [your teammate navigates to this article], and posts a comment at the bottom about how much he likes his particular instructor." Then perhaps swap positions and your teammate comes up to say "The site editor [you switch browsers to the one logged in as a site admin], clicks on the new content page [you click], sees that there are some new comments pending approval, reads this one from a student pilot, and approves it [you click]." You return the browser to the public page where the comment may now be seen in the live site.
Close by parking the browser at a page that reveals as much of the site's overall structure as possible. Don't despair if you weren't able to show every feature of what you've built. Computer applications are all about the tasks that can be accomplished. If you've made the audience believe that it will be easy to complete a few clearly important tasks, you will have instilled confidence in them.
The skeletal implementation may be challenging depending on how ambitious the goals of the content management system are, but perhaps 10 to 20 programmer-hours of work.
Criticizing other teams' work should take about 15 minutes per project criticized or about two hours total in a class with 8 to 10 projects. This could be done as a group or divided and conquered.
Achieving a lived-in look by pouring in real content shouldn't take more than two hours and ought to be divisible among team members.
Talking to the client will probably take about one hour.