lundi 30 mars 2026

Web preservation (or at least not immediate destruction) and you

This is a bit of a tautologism, but most people who spend time debating on the internet like debating. As such, we often see things get meta, and people begin debating about people opinion back them, how they were formulated, how they changed. The problem is that, unlike other domain, the evolution of people takes on the internet is a very poorly archived one, forcing often people to have to rely on guesses and memories. But why should peopl trust your memories ? Why should you trust them ? Why should you trust your own memories after all ? The brain don’t retain things like an hard drive; It get the general idea based on your impression on the moment. Anything read that have all copies deleted may as well never had been read in the first place.

Here is a concrete exemple : A few years ago, I saw someone on twitter saying that like kingdom hearts 3, kingdom hearts 2 was poorly received on release and opinions only begin shifting with the release of final mix. I wanted to object. Sure, I haven’t known how the climate was like during the release, but I was already terminally online before the international release of 2.5 HD ReMix, and as far as I remember Kingdom hearts 2 was already seen like an even better sequel. But what could I say ? I didn’t remember where I read most of this stuff, and the websites I do remember, most of theses were already offline. The other guys told other to "check the forums, they are still online" so not the forums that I remembered, but the one that were still around, probably the big website like IGN. This is a silly problem, but I would like to mark a point : If you don’t archive, then history is written by whoever have enough money to pay the bill. And small website have no obligation to keep paying it just for you.

There is also another thing to point out. You may have noticed a bit of irony as I talk to you about posts I cannot link while using a tweet I did not link as an exemple. Well the problem is that, not only is finding an old info on most social medias really hard, but we recently had several big exodus on twitter, due to the new admin being an open fascist. Several people deleted their datas, not necessarily because they wanted it off the internet, but because they didn’t want them beneficing a person they were opposed with. Some persons even didn’t like deleting all of this, even if they did it for the principle.

A more serious case than a bunch of old takes online would be that website I have found out once while trying to read structure and interpretation of computer programs. It was an extremely complete list of solution for the exercices of the book that didn’t just told you one way to do it, but also had multiple conversations on the ways to approch the problems, which could be very usefull when stuck to see if I was close to having an idea of how to solve it. However, this was back when I was mostly driven by my compulsions so I ended up procrastinating actually seriously working on the book, and the website after a first offline period definitively closed, probably wasn’t fully archived and I lost the url anyway.

What am I trying to get at, is that the true cause of erasure of the internet isn’t necessarily antipathy, but apathy. Lot of people post stuff without thinking of how solid the database is, talking without much differences on social network website that will archive everything forever or unreferenced forums where everything will go down when the owner can’t pay the bill anymore. They say that "the internet never forget" but the correct term would rather be that the internet don’t forget if it can be used against you; with how the web is structured, consider every page a future 404 error.
It’s a bit of a touchy subject with all the preoccupation on privacy, and privacy and archiving are inevitably at odds with each other : the more copies of a data exists, the better is it archived, but the harder is it for someone to value their right to forget. It’s also worth adding the concern over AI in the mix, since some of them leech archives and use technology used for archiving to feed themselves. But I think that if there something you care about on the internet, you can’t let it at the whim of a random ceo that may decide to cut you with his moderation team or sell his company to someone you hate so much you don’t want a word of his writing on his website.

So I would like to show the tools I found out to create "extra copies" of webpage. Most of them can be used without necessarily being the kind of guy who have a nas of multiple terabytes and several crawling scripts running in the background. 

The easy combo (history archiver + external archiving service)

Not exactly the most robust thing but good enough to have an additional layer of security. The idea is to use an extension that can send the page to a website archiving service, and another one to keep the history beyond 90 days so that you can keep access to the url (because theses websites are useless if you don’t have the url…) Of course, this isn’t really true archiving, as theses services are out of your control and could shut down, but it should rather be seen as a way to avoid a single point of failure.

For sending pages to theses third party service, we have the wayback machine official extension. It comes with interesting features like the ability to automatically save a page that hasn’t been archived for a while, or automatically saving every outlink to a page. But it fail on a lot of intensive page, so you often have to check if the page was saved properly.

More dubious morally wise is archive.today and it’s unofficial extension archive page. Archive.today is much more powerful than wayback machine, being able to load really fast webpages that the wayback machine would struggle archiving, Archive.today also have a weird case of enabling a ddos with it, and ignore robot.txt, which is very useful for saving forums but can be argued as kinda crossing the line on what should be archived. It’s ultimately up to you to see what you value the most.
Also, it have the other flaw to reload multiple time his page when archiving a website, meaning it can quickly clog up your history if you use an extension that retain multiple visits. Thankfully you don’t actually need to keep the tab open once the archiving processus has begun.

As for history archiver, there is a lot of options out there. There is history trends unlimited and history plus that seems to be the most popular for chronium, but feels free to see what other options are out there. Chrome can show your history in the my activity section, but in top of the privacy concern, I find it much slower than a local database, google might decide to delete your data, and it’s easy to accidentally do a full deletion while trying to clean the cache. So try at least to back it up using google takeout. Worth noting that depending of the browser and the software you’re using, theses extension can randomly die, but thankfully there is an autobackup feature in place, even if it can quickly clog up your download folder. If you’re using safari or firefox, your extensions choice are much more limited, but firefox can retain history data before an arbitrary limit and safari stock up to six month by default, but that can be modified in settings. Both only store the most recent access to a page.

Actually having the page on your machine : Singlefile

Singlefile is an extension that basically an all around improvement on the default downloading page option of your navigator. The two most interesting feature of singlefile imo are the network settings and how it work as a capture of the page you moment you see it. The network settings allow you to select what kind of ressources you want to download from the page, which can be usefull to not download over and over the same picture files you don’t need. The result are impressive; Page that would weight multiple megabites even when trying to save them in raw html with the default download option of the browser can be only a few kilobytes here. The whole capture at the moment thing mean that for exemple, if you have loaded information requested through an ajax script or a get request, singlefile can capture it without problems, something that online archiving often can’t do and that I think might sometime go beyond archivebox capability, since it works by sending an headless browser to a designated address. To give a concrete example : I like sometime to read patreon blog post about game design. However, some of the most interesting points are in the back and forth of the comment section. Archive.is, which can capture additional reddit comments just fine, can’t get theses in my experience. However, once they are loaded, they are treated by singlefile like any other text on the page.

Singlefile come with other features, like the ability to automatically download a copy of each page opened in a tab, or directly loading a list of url adress to ask it to automatically archieve them. However, it still have heavy limitations and with how, being an extension, it’s stuck cloging your download folder if you don’t install the companion program, which make me say that if you want to get in theses kind of advanced uses, you might be better off using Archive Box.

Getting videos on the computer : yt-dlp

yt-dlp is a line command program that can directly download videos from youtube if you give him the url. The program is under the mit license, which means that other programs with similar features like 4k video downloader probably run a version of yt-dlp with a gui under the shell. However, yt-dlp doesn’t have strict usage limitations like theses softwares. Plus, what become really interesting is that yt-dlp allow you to save other types of data, even if your computers doesn’t have the space to save the full video, by using the command "--skip-download". For exemple, by using the command "yt-dlp --skip-download --write-sub --sub-lang all --sub-format srv3" I was able to quickly download a bunch of subtitles back when everybody was panicking because it seemed that youtube was deleting all customised captions and upload them on the internet archive. You can also directly save an entire playlist by using it as the argument, and download comments with the --write-comments option.

Backing up discord conversation

I don’t think it’s really a mystery that discord may kinda be going worse and that a lot of persons want to leave. However, a lot of guides and informations has been written on theses server, and a lot of people are frustated at the idea of leaving behind years of conversations.

The first, most morally indiscutable tool is to use the request to get a copy of your personal data. Among other things, this tool will send you a list of all the messages you send in all the differents discord channels. This isn’t the most human friendly version of the data, but at least it’s there on your computer if you get ban or the server get deleted. This option only save your messages and doesn’t give you additional medias, so you will be missing a lot of context.

On the unofficial tool category, the discrub extension allow the user to download the content of the channels of their choice on a selected period with other filters available. It take a bit of time (especially if you use it to fetch reactions) and it can sometime freeze during the process so I would recommand to use something like chunks that cover a certain period, but outside of that it work really great. The programs can export to html to get files directly readable, but also to csv and json to allow further manipulation later on. CSV also seems to store more data than the HTML export. I also tested a bit Discord Chat Exporter which probably eat less ram since it work as a standalone program but I remember doing stuff like quickly selecting all channels except one or two and saving it over a defined period being much more annoying to do than with discrub so I didn’t used it much.
Discord history tracker meanwhile work more as a tool to capture what your current session of discord is watching. It can be used in theory to archive old messages, but I think you’re gonna be here for a while if you ask the auto scroll to deal with channels that are a decade old. Unlike discrub, it can’t fetch reaction data beyond the number of people who reacted to my knowledge. It also store medias in a bdd and only offer to "clean" everything, so I think it’s worth in that regard than discrub where you can target the heavier files if you want to gain space.

This is worth repeating as a warning, downloading discord conversation isn’t that hard. If you have a periodically deleted channel, giving people a 24 hour warning is basically an invitation to launch a scrapping program. Deleting only stop people who come after to see the data.

The most advanced solution : Archivebox

Archivebox basically aim to be an all in one solution for internet archiving. It uses several of the tools in this list, like singlefile or yt-dlp, to keep a copy of the url you feed him. I have a docker container installed on my mac for curiosity, but I find the setup a bit "heavy", especially with their emphasis on saving with multiple save format (even if you can pick which format you want to use). If you want to save a lot of url and have disk space in spare it’s probably a tool worth looking into.

-This is very creepy and I would like to not be archived

If you want to minimize the risk of your data existing in multiple places, the easiest way is to setup your social media account to only allow logged-in users to see it in the first place. This kill most online archive tools, and can even put in difficulty tools to download pages. For exemple, archivebox send by default an headless browser without any cookies or login data, and in the current state of the application it’s recommended to use burner account because login information can be found if the settings to login to website are enabled. There isn’t much you can do to stop people from downloading your data if they really want to however.

Aucun commentaire:

Enregistrer un commentaire

The three game rules framework

 If you have ever seen game design discussion on the internet, especially if you saw say rpg fans and arcade fans fighting each other, you p...