The Thin Grey Line: Using a Combination of Traditional and Current Archival Methods to Archive Blogs

Blogs, akin to online diaries or annotated scrapbooks, are a unique born digital format with established archival merit. The pressing question regarding blogs is not whether they should be archived, but which ones, and how? Combining traditional archival practices with new theories regarding digital records management and preservation is the best way to approach the issue of blog archives. In addition, the graying lines between current and archival material as well as folksonomy and traditional taxonomies will force reevaluation of the roles of record creators, users, and archivists.
 


Introduction
Weblogging, commonly known as “blogging,” is an internet journaling practice incorporating text, images, hyperlinks, and other media. A blog generally includes updated entries in reverse chronological order, often including links to other sites either within the entries or off to the side of the webpage(Blood 2002). In the past few years, the number of blogs online has increased exponentially. As of November 2006, Technorati tracked over 57 million blogs on a variety of subjects ranging from personal diaries to political news. As an alternative media, blogs can potentially influence politics, society, and culture. Blogs, like other communication records, hold vast potential stores of information.
Librarians and archivists have already identified the pressing need to archive blogs (O’Sullivan 2005, Ovadia 2006). However, while the reasoning is certainly justifiable, the current practice of actually archiving blogs has been almost entirely suggestion and speculation. The practice of blogging is still so new that most blogs have yet to fall out of current use, therefore making them eligible for archiving. While it is propitious that the need was identified early, there are still many questions surrounding the practice of how to save these materials for future generations.
Blogs, by their very nature, cannot be archived the same way as fonds of manuscripts. As the number of born digital records like blogs increases, attention must be turned toward new archival methods specific to digital media and the content therein. However, archivists must not lose sight of traditional key components of archival processing, for such practices dictate reasons for archiving and how methods to best offer access. A combination of traditional and new approaches to archiving must be used in order to adequately ensure the survival of significant material for future generations. In addition, the self-archiving nature of blogs, the transfer of archival responsibility to the user, and the difficulty discerning current materials from non-current will shift theories and practices of archival processing. The lines between these approaches will create a grey area between current collections and archives, physical and digital records, and archivists and users.
 
History
     Blogging grew out of the expansion of internet access and the World Wide Web. Early blogs, circa the mid-1990s, were merely single pages with manually updated content. Considered by Blood (2002) to be the first blog, Mosaic’s “What’s New” webpage featured daily links to new websites. These early pages were updated by hand, with creators directly modifying or adding to the webpage code (Wibbels 2006).
     In the late 1990s, blog software such as Pitas and Blogger made it easy to update and manage content, spurring a wave of blogging (Stone 2003). Users no longer needed special coding knowledge to manipulate their online journals. This new software, with user-friendly interfaces and a “what you see is what you get” approach to the World Wide Web helped make blogging accessible to the average internet surfer. Numerous additional blogging services sprang forth, such as Xanga, LiveJournal, BlogSpot and even MySpace. Other concurrent developments, like commercial online services such as AmericaOnline, and the advent of high-speed internet access helped enable users to access the Web (Fecko 1997) and therefore create a blog.  American social and political events in the early years of the 21st century, such as September 11 and the ensuing war with <?xml:namespace prefix =" st1" ns =" "urn:schemas-microsoft-com:office:smarttags"" />Iraq, inspired an increase in blog traffic, both writing and reading (Stone 2003). Bloggers discussing episodes involving prominent citizens like Trent Lott, Jayson Blair, John Kerry and Dan Rather “brought the blogosphere to the attention of the nation” (Hewitt 2005).
     The popularity of blogs increased through affiliation and citations. By offering lists of blogs they follow (called “blogrolls”), as well as linking to articles or events on which they offered commentary, bloggers created an interactive exchange. This system of links, in conjunction with the ability to include multimedia and reader commentary transformed blogs from simple journalistic reports to interactive networks (Kline and Burstein 2005). Blogs have also been adopted into corporate environments, serving as communication tools for employees or marketing techniques (Stone 2004).
 
Significance
     The need to archive blog content seems self-evident. Blogs are a significant aspect of life in the current era with a wide variety of influences. Numerous authors have written about the political, economic and social significance of blogs. Several articles have already been written advocating the idea of archiving blogs. O’Sullivan (2005) describes the blog as a modern form of personal diary, with comparable evidentiary value. The annotated links and embedding of audiovisual multimedia also equate blogs to electronic scrapbooks.  Blogs are a potential next-generation primary source document. These “self-referential repositories of memory” will offer evidential value as records of the average person (O’Sullivan 2005).
Ovadia (2006) claims blogs hold insight into mainstream media alternatives. Blogs offering reaction to and commentary on mass media offer a unique perspective of culture that should not go undocumented. Blogs are becoming “communication centers for many kinds of new ideas and new thinking” (Kline and Burstein 2005), and providing alternate points of view (Blood 2002).
     In addition to these primary values, blogs offer a vast array of secondary value to future researchers. Blogs can add more detail to a published story or event by offering futher links and references; blogs can inspire further investigation by bringing up questions and issues; blogs can demonstrate insight into media process and evolution of media communication or information dissemination (Ovadia 2006). To take it one step futher, blogs can even offer a view of technological evolution (O’Sullivan 2005), and someday, a history of the evolution of blogging. The value of blogs has already been well-established.
     What has yet to be established is how to go about achieving a successful blog archive. Suggestions have been offered, some from the authors calling for blog preservation, some from a broader perspective of internet archiving: after all, a blog is, at its very basic level, a webpage. However, due to their contemporary nature and short history, blogs are still considered current records. Few blogs have fallen into the category of “non-current,” the very definition of the type of records for inclusion in archives (National Archives and Records Adminstration 1984). The nature of continuous updates and commentary encourages repeat visits to the blog (Blood 2002) and keeps the record in use. Blogs as we known them are still very much current documents; few have been in existence long enough to complete a traditional record life-cycle. It is true that some blogs fall out of use, but the limited time frame cannot offer true perspective—what if the blogger decides several years down the line to continue writing? Blogs focused on topics or events may go dark until a recurrence of noteworthy items to discuss. Determining whether or not a blog is current is difficult at best.
     Since few blogs at this time qualify for archival treatment, the impetus to implement any of these suggestions is lacking. In future years, if and when blogs cease to be current, archivists will be inundated with issues if not addressed in advance. Archivists need to think about how to archive these digital materials before they are even made (Zelenyj 1999). By looking ahead and beginning to implement and evaluate archival stategies for blogs, as well as other born digital materials, archivists may be able to create a seamless transition from current blog to sunsetted blog, ensuring unbroken access and original order.
 
Challenges
Archiving blogs is riddled with challenges. As a type of webpage, blogs are subject to the same issues as both electronic records and other web pages. Electronic data is ethereal and ephemeral, and subject to deletion through human error or equipment malfunction. Blogs are “born digital,” that is, they were created digitally, as opposed to a physical object later transferred to an electronic format. Digital records are typically less stable media than physical formats (Hunter 2000), and unlike digitized physical records, there is no original copy to revert to or rely upon if the digital file is corrupted. Blogs are dynamic; adding and modifying content is key to the concept of blogging. Because they contain links, multimedia, and other references, blogs have ambiguous boundaries (Lyman 2002). These inclusions also bring up issues: external links not under the blogger’s control may break, or the site to which they link may disappear. To what depth should links be followed (Masanes 2005)? And to what depth can they be followed without infringing on the intellectual property rights of others?
Approaching blogs via traditional archival methods does not address all the issues posed. Relying strictly on traditional methods and theories of acquisition, appraisal, arrangement and preservation does not adequately ensure preservation and future access to blog materials. New archival approaches, especially methods involving the nature of born digital records, must be considered. However, traditional methods are based in sound archival theory and should not be immediately discarded for all-new, unsubstantiated methods. As user collaboration in archival processing becomes more prominent, education, outreach, and evolution will be necessary on the end of both the user and the archivist. Examining both new and traditional archival approaches to blogs and developing standards combining both aspects can offer the best possibilities for the future.
 
Acquisition
     Traditional acquisition cannot address all the needs of the ethereal and electronic blog record. There is no physical form of a blog that a repository might acquire in the traditional sense. While some electronic records may be stored on media devices like disks or CD-ROMs, blogs are a collection of information bytes stored or “hosted” remotely on a server (Blood 2002). When a blog ceases to be current, it is not boxed up and carted to an archival repository somewhere, like paper records might be. A blog can sit parked in cyberspace despite a lack of updates until it is moved or deleted from the server. This is usually either self-inflicted (the creator removes the work) or due to a malfunction of the server. In both cases, the information is lost forever without pre-established back-up arrangements. Most non-current blogs have simply been abandoned, like a half-completed diary. These blogs remain hosted and can remain accessible long after contributions to it cease. So how can archivists know when to acquire a blog, and how do they go about doing it?
Lyman (2002), discussing web page archiving, states that the initial responsibility to archive rests with the copyright holder. This is comparable to any sort of intellectual property: just as manuscript papers are the responsibility of the creator, so too is the content of a blog. All blogging software offers an archiving function, automatically saving past entries (Blood 2002). Blogger even offers users opportunities to customize their archives page, or the choice to not keep an archive of entries at all, though Stone (2003) considers that option unbearable. Navigating the archive options in blog software may be challenging to the layperson (Doctorow et. al., 2002). Inclusion of archival functions within software places the reliance and responsibility on the creators of the blogs rather than the collectors (Koltun 1999). Here is where education and outreach become critical: if the average user is responsible for archiving, he or she needs to know best practices. The archive function of most blog software is sufficient, if the creator knows how to use it. But what of bloggers not using a pre-made software package? Bloggers hosting their own creations, from individuals to corporate blogs, also need to establish archival practices.
Lyman (2002) also notes that archival responsibility can be “subcontracted to others,” just as a manuscript may be handed over to a repository for archiving. The creators of blog content have control over the archives of their creations. The ignorance of what to do with with blog records possibly contributes to abandonment and loss. Just as archivists promote and solicit physical collections, they must do the same for digital data like blogs. When offered a collection, archivists should inquire after related electronic documents, especially intangible records.
     A new approach to acquisition for digital records is automation (Masanes 2005). Programmable “spiders” crawl the web, acquiring screenshots of webpages over time. Archives such as the WaybackMachine from the Internet Archive use spiders to build their collections (Alexa Internet 2007). Really simple syndication (RSS) can be used to collect and archive changes in web page content. Despite the inherent limitations, O’Sullivan (2005) suggests the Internet Archive model as a means of archiving blogs. However, if blog content and archival responsibility belong to the blog’s creator, acquiring material through spidering—akin to duplication—may violate copyright. Archivists desiring certain collection materials cannot just take them, or grab copies of them. Letters, wills, deeds and other acquisition records evolved to combat issues like these (Peterson 1979). While the preservation aspect of spidering may work, it is critical to involve the creator of the blog, rather than stealing their content from the web.
 
Appraisal
     Traditional archival appraisal was based on the concept of determining value: what do archivists save, given the premise of limited storage (Hunter 2000)? As digital rather than physical documents, blogs demand no boxes or cubic feet of storage space. Capabilities for electronic information storage are increasing exponentially (Deegan and Tanner 2002) and so the notion of limited storage becomes moot.
     However, just because it is possible to save all blogs, the amount of material in the 57 million exisiting blogs (Technorati 2006) is overwhelming. O’Sullivan (2005), speaking at the single repository level, states it is not practical to keep all blogs. She suggests appraisal for determination of value, yet she offers no concrete suggestions. One of the originators of traditional archival theory, T. R. Schellenberg (1956) says that in addition to storage constraints, an overwhelming amount of information in records can be daunting to researchers. Schellenberg spoke of physical records, but the concept is applicable to blogs.
This is where an appraisal and collection policy based on traditional archival theory is key. Cox (1994) notes that a faulty appraisal process is better than none at all. Even scholars of digital preservation rest the first step to appraisal on an established policy for retention of digital material (Deegan and Tanner 2002). Every archival institution needs to address digital data in their policies, such as was done at the Wellcome Library (Hilton and Thompson 2007). The same authors also note that because of the ethereal nature of digital material, archivists must act quickly and appraise and process immediately upon acquisition: while physical papers can feasibly sit in a archival storage room for year, storing digital material for the same amount of time may render it inaccessible or obsolete.
 For websites, Masanes (2005) suggests either a site-, topic-, or domain-based approach. A single site can be archived by the user and later transferred to a repository if desired. Or, the repositiory can take over the exisiting host, keeping the same web server and location. This transfer would go nearly unnoticeable to users, with minimal service interruption, if any. A topical approach could be instigated by a repository with collections focused on certain subjects. The collection policy would include blogs, which archivists would then pursue in the same manner as other physical collections, such as purchasing or soliciting donations. A topical approach might also apply to an online services vendor, such as the service model of Ebsco, Wilson, or other service providers. Just as these vendors offer archives of newspapers and magazines, so too could they offer a database of archived blogs. This potentially benefits all involved: the vendor increases business; the blogger sees his work archived safely; the archivist is freed to work on other projects; and the researcher receives pre-organized electronic access. A domain-centric approach may address blogs and other electronic records from corporations or institutions. For example, many academic libraries offer an archive for the school, and a domain-centered approach would archive all web materials from the institution’s domain name server (DNS), including blogs sponsored by the school. Domain-centered collection could also apply to business or corporate records with a shared DNS, especially considering the aforementioned impact of blogs have on marketing and business. While based in the traditional archival concept of a collection, these web-specific suggestions are new ways to categorize and organize collections.
 
Arrangement, Description and Organization
Once blogs are acquired and appraised, they must be organized. Like other record materials, blogs are still subject to arrangement, description and cataloging in order to provide access to users seeking these materials. This process will vary depending on the nature of the blog and the collection to which it may belong.
     Blogs are already highly organized at the item level even before they qualify for addition to an archive. Entires are stored in reverse chronological order, so a blog is acquired pre-arranged. To disturb this order would disturb the very nature of the blog and its inherent concept, and also defeat the archival principle of provenance and retaining the original order of materials. The electronic nature of blogs already offers the easy possibility of full-text searching. The recent development of “tagging”—the ability for users to describe subject content in their own terms—is now featured in many blogs and blog software. This is a huge advantage, not only because it offers insight and context, but it also pre-defines subjects and content, a task once relegated to the archivist. Even if the lack of controlled tag vocabulary forces the archivist to assign authorized subject headings, suggestion points on where to start are already there. This will save archivists time, and as folksonomies advance, subjects will become more accurate because they were defined by a creator with close knowledge. Researchers will be able to access material more easily as they need not lean on the unfamiliar, often outdated language of professional standardized subject headings. While folksonomy alone is unregulated and uncontrolled, archivists and librarians working together with users creating tags could create a balanced vocabulary to serve all groups working with the materials. Pre-arranged blogs also speed processing time, and lower the risk of obsolescence from storage before processing.
Blogs included in collection acquisitions, partially or fully electronic, can be included in finding aids and/or online catalogs and described according to standards, just as other electronic media: “extent: 4 boxes, 3 floppy disks, 1 blog.”
A collection of blogs offered by an archive or vendor will require different description. Some libraries have used MARC to successfully catalog websites (Tan 2001); the same could be done for blogs in those collections. Blogs would need to be directly or indirectly included in archival description standards, such as Describing Archives: A Content Standard (DACS) and Resource Description and Access (RDA). While the “output-neutral” DACS currently offers description standards for electronic and digitized material, it does not specifically address born digital records like websites and blogs (Society of American Archivists 2007).  RDA is theoretically designed with the ability to describe all nature of media, however, it is still in the draft stage and not much is know yet about its practical applications.
 
Preservation
     Preservation of digital data is such a new concept that no definitive conclusions have been set forth by archivists for best practices. As late as 2000, some archivists still printed out hard copies of web pages in the belief that the physical media was more reliable for long-term preservation (Hunter 2000). This is ridiculous. Not only would paper copies exceed physical storage abilities, but the printouts lose the sense of the document in its original context, where hyperlinks and embedded media play a significant role. These additional materials are lost in printed reproductions, instilling intrinsic value (National Archives and Records Administration 1982) in the original web page record.
     Blogs and other webpages actually offer significant storage and preservation potential. Blog archives currently stored on blog software servers like Blogger or LiveJournal can theoretically remain there indefinitely, until the company goes out of business or some other motivating factor shuts down the server. Alexa Internet, creators of the Internet Archive, indicate three areas of preservation concern: accidents, migration, and data format (2007). Accidents, such as natural disasters and equipment malfunctions, are easily combatible with backup devices in remote locations. Migration of blogs to new physical storage media is more easily addressed than other types of electronic documents. Born digital blog files are stored on large servers, which may require equipment migration every 30 years (Alexa Internet 2007), but this is a far cry from transferring a document from a floppy disk, to a CD-ROM, to a flash drive, to a hard drive. Blogs run on common, freely available software, and are not limited by program, version, speed, or proprietary software (O’Sullivan 2005). A blog originally built and viewed under Internet Explorer 5.0 can still be read an accessed with IE 7, Firefox 2.0, or any other graphical web browser. So long as a browser can parse the programming language of the blog, HTML, the blog can still be accessed. Blog files can easily be transferred to a new host server, such as a vendor or a repository. Archives should earmark space for this purpose, such as the “virtual archives server” suggested by Hunter (2000).
 
Conclusion
     As technology continues to evolve, archivists will need to address the issues these new technologies raise in terms of further preserving records of cultural heritage. One of the newest significant records necessitating investigation is the blog. The established significance of blogs to society and culture warrants archiving. Now the issue jumps from theory to practice. Traditional archival methods do not address the unique needs the blog presents. Relying solely on new concepts without established precedent and leaving archives completely in the hands of users does not assure long term retention. A combination of traditional and new archival methods, as well as a union between archivist and user must be established if this segment of our culture is to survive.
References
 
Alexa Internet (2007). Internet Archive FAQ. Retrieved April 20, 2007, from http://www.archive.org/index.php
 
Blood, R. (2002). The Weblog Handbook. Cambridge, MA: Perseus.
 
Cox, R. (1994). The documentation strategy and archival appraisal principles: a different perspective. In R. C. Jimerson (Ed.), American archival studies: Readings in theory and practice (pp. 211-241). Chicago : Society of American Archivists.
 
Deegan, M. and Tanner, S. (2002). Digital Futures: Strategies for the Information Age. New York: Neal Schulman.
 
Doctorow, C., et. al. (2002). Essential Blogging. Sebastopol, CA: O’Reilly.
 
Fecko, M. B. (1997). Electronic Resources: Access and Issues. London: Bowker-Saur.
 
Hilton, C., and Thompson, D. (2007) Collecting Born Digital Archives at the Wellcome Library. Ariadne (Online)50(1) Retrieved March 20, 2007, from http://www.ariadne.ac.uk/issue50/hilton-thompson/
 
Hewett, H. (2005). Blog: Understanding the Information Reformation That’s Changing Your World. Nashville, TN: Thomas Nelson, Inc.
 
Hunter, G. (2000). Preserving Digital Information. New York: Neal Schulman.
 
Kline, D. and Burstein, D. (2005). Blog!: How the Newest Media Revolution is Changing Politics, Business and Culture. New York: CDS Books.
 
Koltun, L. (1999). The promise and threat of digital options in an archival age. Archivaria 47(1), 114-135.
 
Lyman, P. (2002). Archiving the World Wide Web. In Building a National Strategy for Preservation: Issues in Digital Media Archiving. Washington, DC: Library of Congress. Retrieved April 20, 2007 from http://clir.org/pubs/reports/pub106/web.html
 
Masanes, J. (2005). Web archiving methods and approaches: a comparative study. Library Trends 54(1), 72-90.
 
National Archives and Records Administration (1982). Intrinsic value in archival materials. In M. F. Daniels and T. Walch (Ed.), A modern archives reader (pp. 91-99). Washington, D. C. : National Archives and Records Administration.
 
National Archives and Records Administration. (1984). Glossary. In M. F. Daniels and T. Walch (Ed.), A modern archives reader (pp. 339-342). Washington, D.C.: National Archives and Records Administration.
 
Ovadia, S. (2006). The need to archive blog content. The Serials Librarian 51(1), 95-102.
 
O’Sullivan, C. (2005). Diaries, on-line diaries, and the future loss to archives. The American Archivist 68(1), 53-73.
Peterson, T. H. (1979). The gift and the deed. In M. F. Daniels and T. Walch (Ed.), A modern archives reader (pp. 139-145). Washington, D. C. : National Archives and Records Administration.
 
Schellenberg, T. R. (1956). The appraisal of modern public records. In M. F. Daniels and T. Walch (Ed.), A modern archives reader (pp. 57-70). Washington, D. C. : National Archives and Records Administration.
 
Sifry, D. (2006). State of the Blogosphere. Technorati weblog. Retrieved March 20, 2007, from http://technorati.com/weblog/2006/11/161.html
 
Society of American Archivists (2007). Publication details: Describing Archives: A Content Standard. Retrieved on April 30, 2007, from http://www.archivists.org/catalog/pubDetail.asp?objectID=1279
 
Stone, B. (2003). Blogging: Genius Strategies for Instant Web Content. Berkeley, CA: New Riders.
 
Stone, B. (2004). Who Let the Blogs Out? A Hyperconnected Peek at the World of Weblogs. New York: St. Martin’s.
 
Tan, W. (2001). Cataloging websites for a library online catalogue. Journal of Educational Media & Library Sciences39(2) 98-105.
 
Wibbels, A. (2006). Blog Wild! A Guide for Small Business Blogging. New York: Penguin.
 
Zelenyj, D. (1999). Archivy ad portas: the archives-records management paradigm re-visited in the electronic information age. Archivaria 47(1) 67-84.


Back to top