Ferret File System/app: Difference between revisions

From Woozle Writes Code
Jump to navigation Jump to search
No edit summary
No edit summary
Line 3: Line 3:
==Description==
==Description==
* There is a database of all files in a one or more folders (could be the entire filesystem, or just selected folders). I'll call these "source folders".
* There is a database of all files in a one or more folders (could be the entire filesystem, or just selected folders). I'll call these "source folders".
* Each record stores all the info found in the directory for its file, along with a hash of the contents.
* Each record stores all the directory info found for each corresponding file, along with a hash of the contents.
* You also give FileFerret one or more folders that it can control, i.e. store stuff however it wants -- doesn't have to be human-friendly. (Ideally, it would have areas on multiple drives -- one for every drive that you're using to archive stuff.) I'll call these "archive spaces".
** The hash should be relatively large, to minimize the chances of a hash collision (which could result in mistakenly deleting all copies of a file).
* You also give FFS one or more folders that it can control, i.e. store stuff however it wants -- doesn't have to be human-friendly. I'll call these "archive spaces".
** Ideally, it would have areas on multiple drives -- one for every drive that you're using to archive stuff -- and ultimately in different physical locations, to allow for disaster recovery.
* You can mark each source folder as to whether it should be left alone, or should be "managed" -- which means, e.g., we want to remove unnecessary identical copies, especially of larger files, and just keep a record that the file was there (maybe replace it with a link, maybe don't bother).
* You can mark each source folder as to whether it should be left alone, or should be "managed" -- which means, e.g., we want to remove unnecessary identical copies, especially of larger files, and just keep a record that the file was there (maybe replace it with a link, maybe don't bother).
* You can also tag files/folders (either individually or by various criteria) as being of various levels of criticality --basically, how many backup copies do we want?
* You can also tag files/folders (either individually or by various criteria) as being of various levels of criticality --basically, how many backup copies do we want?
* FF then looks for redundant files and removes them. If we want more copies than currently exist, it will make a copy in one or more of its archive spaces.
* FFS then looks for redundant files and removes them. If we want more copies than currently exist, it will make a copy in one or more of its archive spaces.
* Once we have everything indexed and deduplicated, we can start using FF to manage new content -- instead of trying to organize it just with folders and naming, we can tell FF to just manage a given set of new files. It'll check for duplication, make sure that the requisite number of copies are made, and put them wherever there's room.
* Once we have everything indexed and deduplicated, we can start using FFS to manage new content. Instead of trying to organize it just with folders and naming, we can tell FF to just manage a given set of new files. It'll check for duplication, make sure that the requisite number of copies are made, and put them wherever there's room. Manual organization is done via searchable semantic data (which can be viewed as a folder-tree if desired).
==Process==
==Process==
===Part 1===
===Part 1===
Line 16: Line 18:
* file size
* file size
* file ownership, system attributes, etc.
* file ownership, system attributes, etc.
* contents fingerprint (hash, presumed to be unique)
* contents fingerprint (hash, usually presumed to be unique)


During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).
During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).

Revision as of 14:37, 25 February 2024

FFS as an application

It seems like it should be doable to implement FFS as a userland/desktop application rather than a core filesystem feature. This should make it possible to gain the benefits of these services without investing as much development time, allowing them to be tested and refined before being attempted as a kernel-level service.

Description

  • There is a database of all files in a one or more folders (could be the entire filesystem, or just selected folders). I'll call these "source folders".
  • Each record stores all the directory info found for each corresponding file, along with a hash of the contents.
    • The hash should be relatively large, to minimize the chances of a hash collision (which could result in mistakenly deleting all copies of a file).
  • You also give FFS one or more folders that it can control, i.e. store stuff however it wants -- doesn't have to be human-friendly. I'll call these "archive spaces".
    • Ideally, it would have areas on multiple drives -- one for every drive that you're using to archive stuff -- and ultimately in different physical locations, to allow for disaster recovery.
  • You can mark each source folder as to whether it should be left alone, or should be "managed" -- which means, e.g., we want to remove unnecessary identical copies, especially of larger files, and just keep a record that the file was there (maybe replace it with a link, maybe don't bother).
  • You can also tag files/folders (either individually or by various criteria) as being of various levels of criticality --basically, how many backup copies do we want?
  • FFS then looks for redundant files and removes them. If we want more copies than currently exist, it will make a copy in one or more of its archive spaces.
  • Once we have everything indexed and deduplicated, we can start using FFS to manage new content. Instead of trying to organize it just with folders and naming, we can tell FF to just manage a given set of new files. It'll check for duplication, make sure that the requisite number of copies are made, and put them wherever there's room. Manual organization is done via searchable semantic data (which can be viewed as a folder-tree if desired).

Process

Part 1

The FFS app scans a user-specified set of folders, recording in a database every file found and logging any changes noted. Each file-record would include:

  • full path/name
  • timestamps
  • file size
  • file ownership, system attributes, etc.
  • contents fingerprint (hash, usually presumed to be unique)

During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).

There would also be semantic data tables, to allow users to enter arbitrary information about each file as well as a set of criticality settings (e.g. how many copies should there be, to minimize the chance of losing it in a hardware failure).

Part 2

FFS checks for duplicate files, and ensures that duplication for each file is within the limits described by the user -- that is, files with an insufficient number of backups will have additional copies made (on separate volumes), and over-duplicated files will be winnowed (either replaced with links or deleted, depending on various conditions).