This is something that keeps me worried at night. Unlike other historical artefacts like pottery, vellum writing, or stone tablets, information on the Internet can just blink into nonexistence when the server hosting it goes offline. This makes it difficult for future anthropologists who want to study our history and document the different Internet epochs. For my part, I always try to send any news article I see to an archival site (like archive.ph) to help collectively preserve our present so it can still be seen by others in the future.

  • lloram239@feddit.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    Ultimately this is a problem that’s never going away until we replace URLs. The HTTP approach to find documents by URL, i.e. server/path, is fundamentally brittle. Doesn’t matter how careful you are, doesn’t matter how much best practice you follow, that URL is going to be dead in a few years. The problem is made worse by DNS, which in turn makes URLs expensive and expire.

    There are approaches like IPFS, which uses content-based addressing (i.e. fancy file hashes), but that’s note enough either, as it provide no good way to update a resource.

    The best™ solution would be some kind of global blockchain thing that keeps record of what people publish, giving each document a unique id, hash, and some way to update that resource in a non-destructive way (i.e. the version history is preserved). Hosting itself would still need to be done by other parties, but a global log file that lists out all the stuff humans have published would make it much easier and reliable to mirror it.

    The end result should be “Internet as globally distributed immutable data structure”.

    Bit frustrating that this whole problem isn’t getting the attention it deserves.

    • Lucien@beehaw.org
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      3 years ago

      I don’t think this will ever happen. The web is more than a network of changing documents. It’s a network of portals into systems which change state based on who is looking at them and what they do.

      In order for something like this to work, you’d need to determine what the “official” view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren’t just generated on the spot, they’re Turing complete documents which change themselves over time.

      It’s a bit of a quantum problem - you can’t perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

      Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won’t be able to change the underlying nature of what it is storing.

      If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can’t preserve all of the value by simply making a snapshot of the static content available.

      All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn’t matter to the creator. It’s a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You’d never get enough people to agree to it that it would make a difference.

      • lloram239@feddit.de
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 years ago

        but the reality is that most documents are generated on the spot from many sources of data.

        That’s only true due to the way the current Web (d)evolved into a bunch of apps rendered in HTML. But there is fundamentally no reason why it should be that way. The actual data that drives the Web is mostly completely static. The videos Youtube has on their server don’t change. The post on Reddit very rarely change. Twitter posts don’t change either. The dynamic parts of the Web are the UI and the ads, they might change on each and every access, or be different for different users, but they aren’t the parts you want to link to anyway, you want to link to a specific users comment, not a specific users comment rendered in a specific version of the Reddit UI with whatever ads were on display that day.

        Usenet did that (almost) correct 40 years ago, each message got an message-id, each message replying to that message would contain that id in a header. This is why large chunks of Usenet could be restored from tape archives and put be back together. The way content linked to each other didn’t depend on a storage location. It wasn’t perfect of course, it had no cryptography going on and depended completely on users behaving nicely.

        Doing so is definitely possible, particularly if they decide to cooperate with archival efforts.

        No, that’s the problem with URLs. This is not possible. The domain reddit.com belongs to a company and they control what gets shown when you access it. You can make your own reddit-archive.org, but that’s not going to fix the millions of links that point to reddit.com and are now all 404.

        All that said, if we limit ourselves to static documents, you still need to convince everyone to take part.

        The software world operates in large part on Git, which already does most of this. What’s missing there is some kind of DHT to automatically lookup content. It’s also not an all or nothing, take the Fediverse, the idea of distributing content is already there, but the URLs are garbage, like:

        https://beehaw.org/comment/291402

        What’s 291402? Why is the id 854874 when accessing the same post through feddit.de? Those are storage locations implementation details leaking out into the public. That really shouldn’t happen, that should be a globally unique content hash or a UUID.

        When you have a real content hash you can do fun stuff, in IPFS URLs for example:

        https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

        The /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf part is server independent, you can access the same document via:

        https://dweb.link/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

        or even just view it on your local machine directly via the filesystem, without manually downloading:

        $ acrobat /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

        There are a whole lot of possibilities that open up when you have better names for content, having links on the Web that don’t go 404 is just the start.

        • soiling@beehaw.org
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 years ago

          re: static content

          How does authentication factor into this? even if we exclude marketing/tracking bullshit, there is a very real concern on many sites about people seeing the data they’re allowed to see. There are even legal requirements. If that data (such as health records) is statically held in a blockchain such that anyone can access it by its hash, privacy evaporates, doesn’t it?

          • lloram239@feddit.de
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 years ago

            How does authentication factor into this?

            That’s where it gets complicated. Git sidesteps the problem by simply being a file format, the downloading still happens over regular old HTTP, so you can apply all the same restrictions as on a regular website. IPFS on the other side ignores the problem and assumes all data is redistributable and accessible to everybody. I find that approach rather problematic and short sighted, as that’s just not how copyright and licensing works. Even data that is freely redistributable needs to declare so, as otherwise the default fallback is copyright and that doesn’t allow redistribution unless explicitly allowed. IPFS so far has no way to tag data with license, author, etc. LBRY (the thing behind Odysee.com) should handle that a bit better, though I am not sure on the detail.

      • LewsTherinTelescope@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        3 years ago

        Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it’s there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

        There’s always a risk of that when posting anything online, but that doesn’t mean systems should be designed to lean into that by default.

    • Corhen@beehaw.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 years ago

      even beyond what you said, even if we had a global blockchain based browsing system, that wouldnt make it easier to keep the content ONLINE. If a website goes offline, the knowledge and reference is still lost, and whether its a URL or a blockchain, it would still point towards a dead resource.

      • lloram239@feddit.de
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 years ago

        It would make it much easier to keep content online, as everybody could mirror content with close to zero effort. That’s quite opposite to today where content mirroring is essentially impossible, as all the links will still refer to the original source and still turn into 404s when that source goes down. That that file might still exist on another server is largely meaningless when you have no easy way to discover it and no way to tell if it is even the right file.

        The problem we have today is not storage, but locating the data.

        • FuckFashMods@lib.lgbt
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 years ago

          Why would people mirror some body else’s stuff?

          Maybe youd personally do a small number of things if you found it interesting, but i dont see that being very side scale.

  • Hedup@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    I don’t think it’s a problem. If everything or most of internet would be somehow preserved, future antropologists would have explonentially more material to go through, which will be impossible. Unless the number of antropologists grows exponentially, similarily how internet does. But then there’s a problem, if the amount of antropologists grow exponentially, it’s beceause the overall human population grows exponentially. If human population grows exponentially, then also its produced content on internet grows even more exponentialier.

    You see, the content on the internet will always grow faster than the discipline of antropology. And it’s nothing new - think about all the lost “history” that was not preserved and we don’t know about. The good news is that the most important things will be preserved naturally.

    • soiling@beehaw.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 years ago

      the most important things will be preserved naturally.

      I believe this is a fallacy. Things get preserved haphazardly or randomly, and “importance” is relative anyway.

      • fckgwrhqq2yxrkt@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 years ago

        In addition, who decides “importance”? Currently importance seems very tied to profitability, and knowledge is often not profitable.

      • CanadaPlus
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 years ago

        It is relative, but it only takes one chain of transmission.

        AskHistorians on Reddit had an answer about this. Stuff is flimsy but also really easy and cheap to make copies of now.

  • strainedl0ve@beehaw.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

    The whole internet is extremely ephemeral, more than people realize, and it’s concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

    • entropicdrift
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 years ago

      I’d argue that it can help or hurt to decentralize, depending on how it’s handled. If most sites are caching/backing up data that’s found elsewhere, that’s both good for resilience and for preservation, but if the data in question is centralized by its home server, then instead of backing up one site we’re stuck backing up a thousand, not to mention the potential issues with discovery

  • Rentlar@beehaw.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    Well stone tablets, writing, songs, culture can disappear with time, either naturally (such as erosion and weather) or through human action (such burning books, destructive investigation of ancient artifacts/ruins)

    That’s why we try to keep good records.

  • old-tymon@lemmy.one
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    Remember a few years ago when MySpace did a faceplant during a server migration, and lost literally every single piece of music that had ever been uploaded? It was one of the single-largest losses of Internet history and it’s just… not talked about at all anymore.

  • aard@kyu.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 years ago

    Another problem is that even if sites and their content stay up they often reorganize it for various reasons - often by importing old content into some new platform - and don’t care about the URLs the content is available at. Which breaks all links to it.

    Some pages at least try to show you a page with suggestions what you might’ve been going for, but I’ve also seen those less and less over the years.

    For my stuff I’ve been making sure to keep links working for over two decades now - on my personal page you can still access everything similary to /cgi-bin/script.cgi?page even though that script and the cgi-bin directory as a whole has been gone for over a decade. But I seem to be pretty alone in efforts trying to keep things at stable locations.

    edit: I just noticed matrix.org broke all links coming from google search at least for bridges. They should’ve known better.

  • m00njuic3@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    3 years ago

    thankfully we do have people trying to archive things. sadly not everything will make it into that. just to much new stuff all the time to keep up with. but if we can keep the important and mostly important stuff

  • altz3r0@beehaw.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 years ago

    I think preservation is happening, the issue lies in accessibility. Projects like Archive.org are the public ones, but it is certain that private organizations are doing the same, just not making it public.

    This is also something that is my biggest worry about the Fediverse. It has tools to deal with it, but they are self-contained. No search engine is crawling the Fediverse as far as I’ve looked, and no initiative to archive, index and overall make the content of the Fediverse accessible is currently in place, and that’s a big risk. I’m sure we will soon be seeing loss of information for this reason, if not already happened.

    • Dee@beehaw.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 years ago

      It’s still fairly new, I’m confident we’ll see fediverse crawlers before too long. Especially with all the attention it’s getting and more developers turning their interests here. I also saw some talk about instance mirroring that would allow backups should an instance go down. Things are in motion.

      Absolutely a problem at the moment but I’m not too worried for the future tbh.

  • thejml@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    It’s important here to think about a few large issues with this data.

    First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.

    Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?

    Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.

    Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.

  • DeGandalf@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    3 years ago

    In this aspect, the internet is closer to spoken language, than any written media. Even if you use a service to archive the things you find, it’s still possible, that they shut down, too.

  • kool_newt@beehaw.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 years ago

    Capitalism has no interest in preservation except where it is profitable. Thinking about the long-term future, archaeologist’s success and acting on it is not profitiable.

    • FuckFashMods@lib.lgbt
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 years ago

      Its not just capitalism lol

      Preserving things costs money/resources/time. This happens in a lot of societies.

      • kool_newt@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 years ago

        And a non-capitalist society could decide to invest resources into preservation even if it’s not profitable.

          • PM_ME_VINTAGE_30S@vlemmy.net
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 years ago

            Could it? Yeah, sure it could, and in some cases it will, but only if someone up the chain thinks it’s profitable. Profit motive should never dictate how archaeology is practiced.

  • CynAq@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    3 years ago

    We need deliberate efforts to archive everything efficiently.

    We also need a way to decouple everyone’s personal info from publicly available information about them, keeping in mind that not all publicly available information is intended to be that way.

    Storage ain’t cheap and it definitely ain’t infinite.

    This is a way harder problem than “the internet” being a bit more mindful can solve easily.

    Not to absolve any companies from responsibility or anything.

    • Trainguyrom@reddthat.com
      link
      fedilink
      arrow-up
      0
      ·
      3 years ago

      We also need a way to decouple everyone’s personal info from publicly available information about them, keeping in mind that not all publicly available information is intended to be that way.

      Here’s a crazy idea, what if the personal information becomes publicly available something like a century or two after their death? How cool would genealogy be if you could go through and know more about these vague people from 2 centuries ago than just “this is bob, he was born on this date, married on this date, had kids on these dates and died on this date. Oh and here’s a single photo that could easily have been misidentified”

  • CynAq@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    3 years ago

    We need deliberate efforts to archive everything efficiently.

    We also need a way to decouple everyone’s personal info from publicly available information about them, keeping in mind that not all publicly available information is intended to be that way.

    Storage ain’t cheap and it definitely ain’t infinite.

    This is a way harder problem than “the internet” being a bit more mindful can solve easily.

    Not to absolve any companies from responsibility or anything.