Catalog as Book, File, and Database

Cataloging History, Part 3

Steven Lubar
15 min readMay 2, 2017


“Cataloging History” is a four-part series on the history and theory of museum and exhibition catalogs, focusing on the 1853 New York Crystal Palace. Part 1 considers the early history of this genre, tracking its roots to the catalog of the Museum Wormianum and the Louvre, and exploring the variety of uses to which early American museums put published descriptions of their collections and exhibitions. Part 2 looks in detail at the catalogs and guides published by and about the 1853 New York Crystal Palace. This third part considers the catalog as physical and digital object examining the affordances of each of these forms, what it encouraged and allowed. Part 4 applies the tools of the digital humanities to explore the Crystal Palace catalogs as digital object. What can we do when we turn the catalog into a database?

(This installment is co-authored with Emily Esten.)

From collections of the Museum of the CIty of New York

The catalog began as a physical thing. It was crafted from the forms submitted by exhibitors and compiled into lists. It was set in type, printed, bound, boxed, and distributed. In the twenty-first century, the physical object was scanned. It became a digital object, a file. It could be used in new ways, put to new purposes. It began a second chapter in its history. Looking closely at process allows us to understand what its creators intended. Looking closely at product allows us to consider the affordances of each of these forms, what it encouraged and allowed.

Creating the Catalog

We know something of how William Richards, the catalog’s editor, and his clerk, Mr. Benners, compiled the catalog from the paratext he provides. Richards drops hints in his “The Arrangement of the Catalog,” a page and a half preface at the front of the book and a slightly longer preface in the revised edition. We can also understand the process by looking at the catalog of the earlier London exhibition. That history, Richards notes, “was so remarkable, that it was told at great length in the Times and other journals.” And for good reason. It was probably the most complicated book production process to that date. 48 tons of type! 338 tons of paper! Indeed, it may have been one of the largest information processing projects to that date. Richards writes:

“The story of the London catalogue, with its toils, its difficulties, its delays, and its demands upon the patience of its compilers, is not inapplicable to the present work…the history of this manual, like that of its great prototype, is not without interest.”

Indeed it is, for it’s the story of turning data into usable information.

Richards created the catalog working from memoranda submitted from exhibitors. An announcement in The New York Times (November 25, 1852, p.1) called these memoranda “A form of Application for Space.” A later announcement described the information they needed to submit:

The London fair used printed forms for this (in four colors!), and the New York fair probably had printed forms, too. None of those forms survive, as far as we know, but we can imagine what they looked like based on one from the 1862 London International Exhibition.

The Photographic Journal, Sept. 16, 1861, p. 259:

The New York Crystal Palace form may have had a check box for the exhibitor’s role (manufacturer, proprietor, or agent) and perhaps for country. There were also, perhaps, other blanks to fill in — perhaps handling instructions, name of local contact — as well as blanks for the clerk and his assistants to complete: fees paid, notifications of delivery and pickup. No doubt there were places for the exhibitors and the clerk to sign their names, taking responsibility and credit for their work. There were probably blanks, to be filled in later, perhaps by another clerk, to indicate the location of the object on the floor of the Exhibition.

These 4000 or so memoranda, one for each exhibit, would be filed and organized. Perhaps that was the job of Samuel Webber, who was responsible for ”Arrangement of Space and Classification.” Perhaps the various experts responsible for each category of goods — machinery, textiles, sculpture — discussed the hard-to-classify items. The memoranda might have been copied, or simply refiled for various uses. Perhaps a copy of the appropriate files went to the manager of each section of the Fair.

Or perhaps Richards simply organized them once, in the order he needed for the catalog, made a longhand copy for delivery to the printer, and distributed copies of the printed book to each of the Fair’s employees who needed to correlate artifacts and information.

No doubt some exhibitors didn’t follow the rules and some of the clerk’s assistants didn’t catch the problem, or added problems of their own. You can almost hear Richards sigh as he adds to the end of his explanations of the catalog this caveat: “Some obscurities exist in consequence of imperfections in the original memoranda.”

Richards goes into much more detail about these imperfections in the “First Revised Edition” sent to the printer on October 1, 1853. “The task of revision,” he writes, “has not been a light one. It has involved both toil and care for several weeks.“ The first edition, he writes, was compiled before the Exhibition opened, “from the meagre material afforded by the original lists” submitted to the clerk. “These were sometimes so illegibly written, or so obscured by translation into English, as to embarrass greatly the labors of the Editor.” Even worse: “numerous parties” never actually sent the promised objects for display. And then there were the erroneous classifications of products!

Richards concludes:

“The Editor has sought diligently to make his work perfect, not with the expectation of literal success, but with the consciousness that only the highest possible aim would insure the greatest practicable approximation to perfection.”

He thanks the “various superintendents of departments” for their help, and finally, the printers, whose “patience and zeal…in making repeated revisions of difficult manuscript, and continual additions to the text up to the latest hour.”

Catalog as Book

The Official Catalog of the New York Exhibition of the Industry of All Nations is a physical object. The one we looked at and handled, a “first revised edition,” is in Brown University’s John Hay Library, call number RAtp N42.2. It’s a bound book, a codex, with the original green paper cover tucked into a maroon library cloth binding. It is 7 ½ inches high by 5 1/4 wide and just under an inch thick.

The Catalogue was published by George P. Putnam & Co., 10 Park Place. He was responsible for organizing production and sales. The firm is described as “Contractors to the Association, &c.” The Association held the copyright.

It was set in what was called a “modern,” or “Scotch,” font, and printed by John F. Trow, Printer, of 49 Ann Street, New York, just a few blocks from the publisher. But it may in fact have been printed at the Crystal Palace.

Richards, A Day in the New York Crystal Palace, p. 109

A image in Richard’s A Day in the New York Crystal Palace shows a woman printer at work at the Exhibition on the “Illustrated Record,” the longer, more narrative version of the catalog.

Richards, A Day in the New York Crystal Palace, p. 109

It’s not particularly well printed: cheap paper, uneven print, occasional careless characters added or dropped, though no typographical errors we noted. Though it’s been rebound, you can see the three holes where the original paper binding would have been attached, stab-stitched.

This particular copy has not only the main catalog (224 pages) but also the “Cabinet of the Mineralogical and Geographic Cabinet” (16 pages, numbered continuously with the main catalog) and the “Catalogue of the Picture Gallery” (22 pages, numbered starting from 1) bound in, with its own fly title. (There are also about 30 blank pages, presumably added by the binder to make his job easier.) It appears that the three parts were originally issued and bound together. There’s no sign of use at the fair — no notes, no dog-eared pages. A library slip shows that the book was checked out in 1963 and 1971. The library has no information on its provenance, but Richard Noble, the library’s Rare Material Cataloguer, suggests that the “buckram library binding looks like one of our early 20th century ones.”

The Catalogue was for sale at the Crystal Palace. The publisher’s agreement with the Crystal Palace Association guaranteed that no books by other publishers would be sold at the Fair. The Catalogue may have been for sale elsewhere, but we’ve found no advertisements. There’s no price on the first edition, but the revised edition cover read “Price 25 Cents.” There’s a clue about sales in a financial statement published in the New York Times. The Association had spent $8,462.70 on printing the catalog, and received $2,832.60 from its sale to December 1, 1853, “in part.” We might guess, then, that at least 11,000 copies were sold. It’s not rare today: Worldcat finds 218 copies in libraries.

It was at the Crystal Palace where most purchasers would find it useful. Its layout mimicked the layout of the fair, organized first by country and then by class of object. The Official Catalogue was a compact physical object, easy to carry. It was easy to use. In the British section of the fair? Turn to that page. Numbers made it easy to match objects with their description. (Or should have: the New York Times complained of “the want of proper labels on the articles themselves,” and “the imperfect state” of the catalog.) You could use the catalog to get an overview of a section of the fair; the categories were small enough, with a few exceptions, that it would have been easy to skim.

From the Revised Edition, pp. 5–6

The format made it easy to annotate, too. The user could dogear a corner to note something of interest, jot notes in the margin, even remove a page to file elsewhere. (Perhaps that explains the missing pages in several of the online copies.)

But the compact physical arrangement of the Catalogue as a codex had its disadvantages, too. “The disadvantages under which the Catalogue has been prepared,” Richards wrote, “render it vain to hope that it is free from great imperfections and errors. Only a second edition could fix those — and even the second edition had its errors. Richards asked that mistakes be reported, so that they might be fixed. “Exhibitors and others interested, therefore, who may detect omissions, or mistakes of any kind, are particularly requested to offer corrections immediately to Messrs. Putman & Co., at the Catalogue office, in the Exhibition building.”

We didn’t look at the physical object until after we had seen several versions online. But seeing, handling, examining, even smelling the real thing offered us something new. The feel of the paper told us something of its quality. We could see the holes from the original paper binding. But perhaps most of all, it had presence, aura. It had been at the Crystal Palace, and now it was in our hands.

Catalog as File

In the first decade of the twentieth century the Official Catalogue became digital. We’ve seen scans of several copies of it, each of them slightly different. Some include the covers; some are missing pages; some include other Exhibition catalogs. They include library markings. The ability to easily see many copies without travel to many libraries is one of the first advantages of the digital incarnation of the catalog. Though mass-produced, each version of the catalog was also unique.

One of them is at the Harvard University Library. It arrived at the library on October 21, 1854, the gift of Rev. Calel D. Bradlee, a Boston clergyman. Perhaps he had picked it up at the fair and dropped it off at the library when he returned home. At some point it was bound in a red cover and cataloged with call number Econ 5958.53.7. It was scanned by Google and uploaded to the Internet Archive by “tools.bub.” Downloaded, it forms a pdf file of 10,499,032 bytes.;view=1up;seq=3

Another is at the Cornell University Library, cataloged T783.D6 N49 (and later, perhaps when it was moved to storage, with arV18829, and the barcode 3 1924 031 227 105). An operator named “gschmidt,” paid by Microsoft, scanned the book in the late afternoon of July 2, 2008, converting the physical object into pixels and then into bits. Other workers manipulated those scans into a file and OCRed it to turn those pixels back into letters and words. On November 16, 2009, “hank_d” uploaded the book to a server for its second life as a digital object.

That’s where we found it, nestled not in a box from the printer nor at a sales booth at the Crystal Palace nor on the library shelf — moments in its life history — but situated in digital proximity to copies scanned from other libraries and promiscuously distributed on servers at Google, the Internet Archive, and HathiTrust. We downloaded it to create a 16,596,367 byte PDF.

Catalog as Database

Richards turned data into a book. Google turned the book into pixels and files. We have taken the file and transformed it once again: we turned it back into data. We have, more precisely, turned information into a database. Like Mr. Richards, we have sought diligently to make this work perfect, and realize that we have fallen short.

A scan of a page of text and below, the OCR’d version. It’s pretty good.

We started with the OCR (Optical Character Recognition) text of the catalog as provided. For many old books, OCR offers only a rough approximation of the text provided —the type, with its many variations for each letter, is hard for a computer to read. The OCRing of the Cornell digital edition was exceptionally clean, and so we used it for this project. Patrick Rashleigh of the Brown Center for Digital Scholarship developed a Python script to convert the text into a csv file. (CSV is an acronym for “comma-separated-values”—that is, a text file with each line of the original text turned into a line of a file, with commas separating the values for each field.) The script extracted country, class, item number, and item from the catalog into the columns of a table. Rashleigh and Elli Mylonas worked with OpenRefine to extract person, role, and place-name for the csv. Finally, we identified issues within the database — primarily lines where the script copied multiple entries into one line — in order to follow up for further review.

From there, through a combination of OpenRefine and manual data wrangling, we were able to clean major issues within the existing csv. For example, New York City appeared in multiple forms throughout the document, like:

  • “New-York City”
  • “New York City ”
  • “New York City.”
  • “NewYork City”

The human eye can recognize these forms as referring to the same place, but the database represents them as a distinct items. This makes it difficult to take advantage of the benefits of a database. OpenRefine allowed us to identify and merge clustered terms like these throughout the document in order to streamline our processes later on down the line.

Not everything can be done easily through data wrangling techniques. We manually separated out the categories of town and country/state, as some lines contained multiple locations.

We had access to the information of the entire catalog, and the database allows us to add in information contained elsewhere. For example, the database includes information regarding the layout of the exhibition by country and class. This information is readily available in the catalog, but as headings, not connected to each entry. So, we added the information for division and court based on the data provided under the heading of each new class-category.

In the catalog, “data” is static — you can’t connect it to items outside the book. However, because this catalog is so rich in location data, we could connect it to other existing databases of geographical information, providing the towns and countries with geographic coordinates. In order to geocode these location, we acquired a Maquest API key for GPS Visualizer, a batch geocoding service. Using the town, state, and country categories, we then used GPS Visualizer to generate longitude and latitude coordinates for the object entries in the catalog .

(Minor discrepancies occurred for places where contemporary naming had changed. For example, GPS Visualizer did not recognize the contemporary city of Venice, Austria (now Venice, Italy.) In subsequent uses of the geocoding site, changes were made to the input data in order to generate accurate results. For those entries that only included a country, generalized coordinates are used. Because we were particularly interested in New York City data, more exact coordinates are available down to an address level for some entries.)

We have provided the Python scripts, JSON review of transformations in OpenRefine, and the final products for each catalog in a public repository on GitHub. Traditionally used for software, GitHub is a sharing and publishing platform on which developers can save and update revisions of code. In this case, it allows us to preserve as much as possible of the project’s documentation, to allow for study, analysis, and reuse of the data. You can follow our process, tracing the details of each step to see what it takes to transform scraped text into something more workable. What’s special about GitHub is that it maintains any changes to the repository. So, if we choose to make changes to the csv in the future, or wanted to run it through OpenRefine again to edit various facets, GitHub can document those changes and store the information in a filing system for future collaborators.

After the cleaning the scraped data from the digitized catalog, the end result is a searchable, manipulable csv. This preserves the information from the original catalog, connects with data dn datasets available, and while it into categories/columns useful for research. Over 4,000 objects from the exhibition are represented in the catalog.

As with any project that involves data scraping and wrangling, there are issues within the csv. The scripts were not comprehensive in their scraping, some data has been placed in wrong columns, and spelling errors are still present throughout the database. But what we have is usable. We can learn new things from it.

Now, onto the important question: now that we have it in this form, what can we do with the data?


A book is an object. It is not the same as the information contained in it. A file is a digital object, and it has many forms, many shapes. The book and each of its subsequent forms — the pixels, the bits, the OCR’d text , the database— each have their own affordances. That is, they can be used in different ways. They allow us to do different things. They encourage use in different ways. They offer different opportunities for understanding.

William Richards took a stack of memoranda, raw data, and turned it into a book. Microsoft and Cornell and “gschmidt” took the book and turned it into a digital object, an OCRed file. We took that digital object and turned it into another kind of digital object, a database. We used that database to ask new questions about the Crystal Palace. That’s the story we tell in Part 4.


These essays are dedicated to the memory of David Jaffee, whose work at the Bard Graduate Center inspired their writing. They are based on a presentation to the Bard Graduate Center symposium on the New York City Crystal Palace, part of the opening ceremonies of the New York Crystal Palace 1853 exhibition.

Information on the employees of the exhibition is from “Topographical map of the New York Crystal Palace and a guide to the Exhibition”

Details of production of the London Crystal Palace catalog are on pp. 12–15 of the Supplemental Volume

On the arrangements for publication of the Crystal Palace’s catalogs, see Ezra Greenspan, George Palmer Putnam: Representative American Publisher, Pennsylvania State University Press, 2000.

Thanks to Brian Croxall, Patrick Rashleigh, and Elli Mylonas at the Brown University Library Center for Digital Scholarship for their support throughout this project. And thanks to Benjamin Shaykin (@bshaykin) for identifying the font of the catalog, and to Richard Noble, the John Hay Library’s Rare Material Cataloguer, for his observations on the library’s copy of the book.



Steven Lubar

Professor of American Studies at Brown University. Author of Inside the Lost Museum: Curating, Past and Present.