Hey Linux community,

I’m struggling with a file management issue and hoping you can help. I have a large media collection spread across multiple external hard drives. Often, when I’m looking for a specific file, I can’t remember which drive it’s on.

I’m looking for a file indexing and search tool that meets the following requirements:

  • Ability to scan multiple locations
  • Option to exclude specific folders or subfolders from both scan and search
  • File indexing for quicker searches
  • Capability to search indexed files even when the original drive is disconnected
  • Real-time updates as files change

Any recommendations for tools that meet most or all of these criteria? It would be a huge help in organizing and finding my media files.

Thanks in advance for any suggestions!

  • @ssm
    link
    -2
    edit-2
    1 month ago

    Ability to scan multiple locations

    find /path/one /path/two [expression]
    

    Option to exclude specific folders or subfolders from both scan and search

    find /some/path -type d ! \( -name  exclusion1 -o -name exclusion2 ... \) [expression]
    

    File indexing for quicker searches

    Not indexing, but you can make find faster through parallelization if you have the extension for xargs.

    # find -print0 is an extension which separates files found by '\0'
    # xargs -0 is an extension that separates by '\0' instead of spaces and newlines
    # xargs -P _x_ is an extension that invokes _utility_ on _x_ separate threads instead of in serial
    find /some/path [expression] -print0 | xargs -0P$(command_to_get_cpu_threads) _utility_ [args]
    

    Capability to search indexed files even when the original drive is disconnected

    I don’t know what the usecase for this is, but you can do something like create a script for cron that periodically dumps the names of files at a mount point to a path like ~/var/log/something, or use a domain-specific unmount script that dumps the paths before unmounting.

    Real-time updates as files change

    Would require a non-portable script that stores each file’s mtime in an array and compares the old mtime against the new mtime using stat, and then loop. Maybe implement as a daemon.

    • @solrize@lemmy.world
      link
      fedilink
      3
      edit-2
      1 month ago

      [search indexed files that are offline] One would hope this is not possible.

      I think the idea is store the search index in a separate place from the file. For indexing text though, I’ve found that the index is comparable in size to the file itself. It’s not entirely clear to me what OP wants to search. Something like email? Obviously if it’s just metadata for media files (kilobyte text description of a gigabyte video) then the search index can be tiny.

      Real-time updates as files change

      Would require non-portable script that stores each file’s mtime in an array and compares the old mtime against the new mtime using stat, and then loop. Maybe implement as a daemon.

      That is what inotify is for.

      I realize your overall answer was mostly snark, but the problems mentioned really do take some work to solve. For example, if you want to index email, you want the indexer to understand email headers so it can do the right things with the timestamps and other fields. You can’t just chuck everything into a big generic search engine and press “blend”.

      I will mention git-annex which is for sort of a different problem, but it can help you manually track where your offline files are, more or less.

      • @ssm
        link
        1
        edit-2
        1 month ago

        Sorry I have .world blocked so I didn’t see your reply until now (wish I could block instances without blocking instance replies, but whatever)

        It’s not entirely clear to me what OP wants to search. Something like email? Obviously if it’s just metadata for media files (kilobyte text description of a gigabyte video) then the search index can be tiny.

        Yeah I amended my post earlier to recommend logging with a domain specific unmount script, but I don’t know why they want to do this.

        I realize your overall answer was mostly snark

        Apparently I’m so good at trolling I troll people even when I’m not trying to troll. :<

        This is what inotify is for

        If inotify works for you, that’s fine. I don’t have any experience with it, maybe I’ll look into it after this, if the usecase ever comes up.

        You can’t just chuck everything into a big generic search engine and press “blend”

        Eh, regex (EREs) is good enough for 99% of usecases honestly. For the 1%, consider using an easier to parse file format.

        • @solrize@lemmy.world
          link
          fedilink
          1
          edit-2
          1 month ago

          Yeah I amended my post earlier to recommend logging with a domain specific unmount script, but I don’t know why they want to do this.

          They have umpty jillion terabytes of video on a shelf full of external HDD’s and they want to know what files are on which drives. In the old days we had racks full of mag tapes and had the same issue. It’s not something new.

          For info about inotify, try web search.

          For text search, you start needing real indexing once you’re over maybe a GB of text. Before that, you can live with grep or SQL tables or whatever.