As the internet has expanded and evolved over the last several, websites used by billions of people around the world have become more sophisticated, but also more difficult to save. Dynamic websites, like those that hold data from journalism projects like COVID-19 maps, are impossible to preserve using conventional tools like The Internet Archive’s Wayback Machine.
To combat this issue, Grand Valley State University alum and current New York University Librarian for Journalism, Media, Culture and Communication, Katy Boss, is working with a team dedicated to preserving the dynamic web.
Boss, along with co-principal investigator Vicky Rampin and lead developers Remi Rampin and Ilya Kreymer, is developing a tool called ReproZip-Web that preserves dynamic web apps and websites. ReproZip-Web is an open-source program that bundles together all the files necessary to run dynamic web apps and saves them as a downloadable .rpz file.
“When you’re thinking about really big data analysis, like astrophysics and those kinds of experiments, to reproduce them you have to have a way to pack them up in order to see what you’re reproducing,” Boss said.
Popular examples of dynamic data journalism projects that ReproZip-Web is built to preserve include “Old Oil Wells” from the “Los Angeles Times,” “Where Harvey’s effects were felt the most in Texas” from “The Texas Tribune” and “Are Hospitals Near Me Ready for Coronavirus?” from “ProPublica.”
Aside from archiving all the background data needed to populate elements like maps and astrophysics experiments, ReproZip-Web also acts as an emulator for dynamic websites. Emulators are typically used to allow programs to run on devices they weren’t originally designed for and are most commonly used for running video games originally designed for consoles on PCs or smartphones.
In this case, ReproZip-Web allows the emulation of dynamic web environments and even discontinued platforms like Adobe Flash, which were the backbone of many popular internet functions in the 2000s-10s. ReproZip-Web needs this functionality because those platforms are not officially offered for download or updated anymore. When update support for tools like Flash shut down, it’s usually only a matter of time before the websites built upon them become inaccessible.
“Nobody saved those old browsers that would render Flash, and browsers today are not rendering flash because it’s no longer supported by Adobe, they’ve sunsetted it,” Boss said. “So it’s issues like that, we don’t really know what the next Flash is going to be.”
Flash’s demise is a perfect snapshot of a problem that’s going to get more common as the pace of internet evolution speeds up. New web development standards are emerging all the time, and as old ones get left behind, so the data will be encased in the websites built upon them. This is why tools like ReproZip-Web are necessary, especially for libraries.
While libraries are known for offering reading material and other ways to access information, they also put a considerable amount of effort into as much as they can. As the vast majority of the world’s information has shifted from print media to digital formats, libraries have been forced to find ways to save not only the data available online, but also the tools needed to access and use it.
“Libraries try and think of the very long term,” Boss said. “50 or a hundred years from now, how are folks going to be accessing the websites of today? That’s really difficult, not knowing what kind of software systems will be around and what we’ll need to support them.”
Trying to keep pace with frequent technological changes and innovations is enough of a problem by itself, but Boss and her team also have to deal with another popular problem that makes information preservation difficult: copyright enforcement. At the moment, emulation sits in what Boss calls “a legal gray area” where emulators themselves are perfectly legal to download and use, but archived programs that need emulators to run may not be due to copyright laws.
“Libraries have a unique mandate that is often protected, and we’re partnering with the Software Preservation Network because they’re on the forefront of this,” Boss said. “They have a lawyer on staff who is working with us to try and make emulators more protected.”
Copyright enforcement is also one of the main reasons that ReproZip-Web was designed to be free and open-source from the beginning. Open-source tools typically make their entire codebases available online for anyone to download and make edits to, which, in the case of ReproZip-Web, will allow future users to make sure the program is usable long after Boss and her team move on to other projects.
“After you have to rotate off the project for whatever reason, or when others see something that is impactful in their communities, they can build their own version of it,” Boss said.
Although copyrighted software is an important part of the internet that needs to be archived, Boss’ interest in data preservation originally stemmed from her background as a journalist. She graduated from GVSU in 2005 and wrote for the “Lanthorn” during her time as an undergraduate student.
“Having been a journalist myself, I can empathize so much with these data workers who worked on a project for an entire year and they publish it, and then three to five years later, it’s already broken and gone,” Boss said.
This unique connection between a GVSU alum and a timely digital issue is what led professor and Director of the digital studies minor Laurence José to invite Boss to speak to her DS 495 class. DS 495 is the digital studies minor’s capstone class and has previously covered contemporary issues like the purpose of emulators and overaggressive data collection.
“Katy’s project on preserving the dynamic web is fascinating and a great way to engage with concepts such as digital preservation, memory and technology,” José said. “We all live, work and create information in these digital spaces and often do not think about what happens when the technology evolves or becomes obsolete.”
More information about ReproZip-Web, including download links and instructions for using the tool, can be found on the project’s GitHub page. The project’s grant proposal, titled “Preserving the Dynamic Web: Building a Production-level Tool to Save Data Journalism and Interactive Scholarship” is also available to read online.