hckr.fyi // thoughts

A Brief History of Markup & XML

by Michael Szul on

Before XML was established as the preferred data format for applications, web applications were a trove of key/value pairs and pipe-delimited strings, while desktop applications ventured into flat files, manifests, and compiled data files. Although both web and desktop applications could and did use databases, there was no standard way of representing that data if you needed to describe it to an external agnostic program.

The history of XML is rooted in the early days of electronic manuscripts. Many of these manuscripts contained special macros used to format the document in a specific way, but eventually, the late 1960’s gave rise to a movement towards generic coding with the use of descriptive elements (e.g., tags) for document formatting. It was William Tunnicliffe and the Graphic Communication Association’s (GCA) Composition Committee that started the move towards generic coding in order to promote a separation of content from formatting. Tunnicliffe presented this idea to the Canadian Government’s Printing Office in 1967. Also at this time, Stanley Rice, a book designer from New York, was in the midst of proposing an idea for editorial tags meant for structure. Later, Norman Scharpf, the director of the GCA, started a generic coding project inside of the Composition Committee of the GCA as a result of noticing trends towards this generic markup. This committee created a GenCode(R) concept that established the idea that different generic codes would be necessary for different types of documents. This GenCode(R) concept eventually evolved into the GenCode Committee, which had a large role in developing the Standardized Generalized Markup Language (SGML).

Charles Goldfarb, meanwhile, was a lawyer by trade, working in Boston—a Harvard law graduate. In his spare time, he created route instructions for sports car rallies, although he had a knack for abstraction that led a friend to compare his instructions with computer programs. Not long after, Goldfarb joined IBM, he was put in charge of a research project at IBM integrating information systems for law offices. From this work, he—along with Ed Mosher and Ray Lorie—invented GML (taken from the initials of their last names) in order to enhance text editing with formatting, and allow the information retrieval systems to work well with the documents. GML was based on the generic coding concepts of Tunnicliffe and Rice, but instead of the simple tagging that many envisioned, GML had formally defined document types and an explicit structure. IBM implemented much of GML into their publishing systems, and as a result, it gained acceptance throughout the industry. For his part, Goldfarb continued to work with conceptual document structures, eventually modeling GML into SGML after coining the term “mark-up language.” This occurred when Goldfarb was eventually asked to join the committee on information processes of the American National Standards Institute (ANSI) to head up the development of text description language standards—much of which would be based off of GML. It was actually in this committee that GML morphed into SGML, and became a standard. SGML was subsequently early adopted by the IRS and the United States Department of Defense.

In parallel with SGML came the most well-known tag-based language: HTML. HTML was created by Tim Berners-Lee in the 1980’s. Berners-Lee, a physics graduate, worked as an engineer in telecommunications before becoming a contractor at the European Organization for Nuclear Research (known as CERN). He led the development of not just the World Wide Web, but defined HTML, and created the concepts of HTTP and URLs. Berners-Lee was the primary author of HTML—although the basics of hypertext were first proposed by Vannevar Bush—with some assistance from a team at CERN. The purpose of hypertext, HTML, and the World Wide Web was so that distributed employees across the globe could share and update information for each other to see. Originally developed solely for the NeXT platform, Berners-Lee and his team wrote the first web browser, which at the time, only processed text files. Eventually, Berners-Lee put the specifications and code for the entire project, including HTML, on the Internet, sparking interest in the Internet community. As more web browsers became available, and more online documents were being produced, various implementations of HTML grew, but no set standard had been created. Eventually HTML was standardized in the 2.0 specification, and has steadily progressed over the years, culminating in its current form as HTML5.

HTML, however, evolved to become only useful for presentation layers and not data descriptions. It was believed that HTML was too limited. Instead, XML was devised (since it was less complex than SGML) to be a data storage and description markup language that was easily readable by both machines and human beings.

Jon Bosak, Tim Bray, James Clark, and a few others came up with the idea of an eXtensible Markup Language (XML). Bray (who worked at Sun Microsystems) was an invited expert at the World Wide Web Consortium (W3C) and co-editor of the XML and XML namespace specifications, but it was Bosak who decided that HTML wasn’t a suitable technology for use in greater information exchanges. Having an appreciation for the power of SGML, Bosak’s leadership has consistently been praised by those who worked with him on the specification. In fact, the W3C eventually reserved a formal identifier (xml:Father) in honor of Bosak. It was Clark who introduced the name XML, while also contributing the idea of the self-closing element tag.

Much like SGML, XML itself is not actually a markup language, but a way of defining a markup language. Although most all XML documents you see today are referred to as “XML” in a general sense, most are forms of XML specifications. For example, Jabber is a XML specification for messaging protocols: Jabber is the markup, but XML is what defines the markup. Today, however, XML is usually used as a term for the markup as well as the specification. These early efforts on XML were joined by the W3C with the standardized documentation consisting of only a fraction of the pages needed for SGML. The W3C even reshaped HTML into XHTML—an XML compliant version of the former.

Although the creation of XML was originally led by technologists at Sun Microsystems in 1996, it was Microsoft, interestingly enough, that played a much larger role in the promotion and acceptance of XML. At the time, Sun Microsystems’ Java programming language was becoming a write-once, run anywhere solution that paid huge dividends in terms of systems interoperability. Microsoft saw this as a threat to their core programming influence, and decided to push XML as an alternative in such cases. Since XML was controlled by the W3C instead of Sun, it gave an air of openness that accelerated adoption. As a result, while Java at the time needed add-on XML packages (before a newer release integrated it more closely), .NET was built with XML in mind. For its role, Microsoft has continued to greatly integrate XML into many of its products and services (including Microsoft Office). XML eventually became important not just to Microsoft, but also Sun, and even IBM.

Ultimately, XML was create as a means to invent data vocabularies. When transmitting data from one machine to another, the data alone is not sufficient; instead, it is necessary to exchange the meaning of that data, and XML allows for such descriptive measures.