“Our primary mission is to preserve open source software for future generations. We also intend the GitHub Archive Program to serve as a testament to the importance of the open source community…and encourage (very-)long-term thinking.”-Github
GitHub has partnered with the Long Now Foundation, the Internet Archive, Arctic World Archive, Microsoft Research, the Bodleian Library, and others to preserve global open source software. The archive will help prevent a potential “digital dark age” when contemporary software data could be lost, over time, to “bit rot“.
The GitHub Archive Program aims to protect and preserve knowledge via perpetual storage of multiple data sets across multiple formats and localities for the next 1,000+ years. Archiving software across multiple organizations and storage forms ensures very-long-term preservation in a manner known as LOCKSS (Lots Of Copies Keeps Stuff Safe). The Archive Program has adopted Long Now Foundation’s “pace layering” approach to code archiving. This strategy maximizes flexibility and durability via real-time and very-long-term storage solutions. The Program is organized into three tiers (Hot, Warm, Cold):
Github’s Arctic Code Vault is a data repository stored in the Arctic World Archive (AWA) – a very-long-term storage facility located 250 meters deep in the permafrost of a mountain in Svalbard, Norway. The Arctic climate is ideal for very-long-term film storage, which can remain safe for hundreds of years sans electricity and human intervention.
Programmers have until February 2nd, 2020 to add their source code to the Vault. On that day, GitHub will capture a snapshot of every active public repository (plus significant dormant repos):
“The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.“–GitHub
The final 24 terabytes of data will be preserved on film reels coated with iron oxide powder. The reels will then be stored in a steel-walled container within a sealed chamber inside a decommissioned coal mine. This library of data is projected to be readable for 1,000+ years via computer or magnifying glass. In addition, GitHub plans to archive all public repositories on Project Silica, which uses AI and hyperspeed laser optics to encode data in quartz glass. In the future, machine learning algorithms will decode the data via images and patterns created through polarized light. This data storage format is projected to be readable for 10,000+ years.
“…the archive will include technical guides to QR decoding, file formats, character encodings, and other critical metadata so that the raw data can be converted back into source code…The archive will also include a Tech Tree…(that) will serve as a quickstart manual on software development and computing, bundled with a user guide for the archive…(it) will also include information and guidance for applying open source, with context for how we use it today, in case future readers need to rebuild technologies from scratch.”-GitHub