Ferret File System: Difference between revisions

From Woozle Writes Code
Jump to navigation Jump to search
No edit summary
No edit summary
Line 28: Line 28:
==as an Application==
==as an Application==
Alternatively, these concepts can be implemented as a userland/desktop application rather than a core filesystem feature. This should make it possible to gain the benefits of these services without investing as much development time, allowing them to be tested and refined before being attempted as a kernel-level service.
Alternatively, these concepts can be implemented as a userland/desktop application rather than a core filesystem feature. This should make it possible to gain the benefits of these services without investing as much development time, allowing them to be tested and refined before being attempted as a kernel-level service.
 
===Description===
The application would scan a given set of folders, recording in a database every file found and logging any changes noted. Each file-record would include:
* There is a database of all files in a one or more folders (could be the entire filesystem, or just selected folders). I'll call these "source folders".
* Each record stores all the info found in the directory for its file, along with a hash of the contents.
* You also give FileFerret one or more folders that it can control, i.e. store stuff however it wants -- doesn't have to be human-friendly. (Ideally, it would have areas on multiple drives -- one for every drive that you're using to archive stuff.) I'll call these "archive spaces".
* You can mark each source folder as to whether it should be left alone, or should be "managed" -- which means, e.g., we want to remove unnecessary identical copies, especially of larger files, and just keep a record that the file was there (maybe replace it with a link, maybe don't bother).
* You can also tag files/folders (either individually or by various criteria) as being of various levels of criticality --basically, how many backup copies do we want?
* FF then looks for redundant files and removes them. If we want more copies than currently exist, it will make a copy in one or more of its archive spaces.
* Once we have everything indexed and deduplicated, we can start using FF to manage new content -- instead of trying to organize it just with folders and naming, we can tell FF to just manage a given set of new files. It'll check for duplication, make sure that the requisite number of copies are made, and put them wherever there's room.
===Process===
'''Layer 1''': The FFS app scans a user-specified set of folders, recording in a database every file found and logging any changes noted. Each file-record would include:
* full path/name
* full path/name
* timestamps
* timestamps
Line 38: Line 46:
During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).
During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).


There would also be semantic data tables, to allow users to enter arbitrary information about each file.
There would also be semantic data tables, to allow users to enter arbitrary information about each file as well as a set of criticality settings (e.g. how many copies should there be, to minimize the chance of losing it in a hardware failure).
 
'''Layer 2''': FFS checks for duplicate files, and ensures that duplication for each file is within the limits described by the user -- that is, files with an insufficient number of backups will have additional copies made (on separate volumes), and over-duplicated files will be winnowed (either replaced with links or deleted, depending on various conditions).

Revision as of 01:11, 25 February 2024

Conceptual Design

Ferret File System (FFS) is actually two ideas which can work together or separately: storage agnosticism and file format agnosticism.

Storage Agnosticism

Traditional filesystems present "drives" or "volumes" that more or less correspond to physical devices, each with its own invariable maximum capacity. This is the metaphor used across all interfaces, including user (GUI, CLI) and application (API).

FFS presents the user with a single storage space that unifies all storage to which the system has access but tracks the availability characteristics of each file. Immediately obvious attributes include:

  • location: What is the physical or metaphorical "place" from which a file is being accessed?
    • Examples: a particular device, a LAN, a building, "the internet"
  • frequency: How often does the file need to be accessed from that location?
    • Examples:
      • archival storage (rarely accessed)
      • social media archive (may vary dynamically)
      • local work (needed a lot by one or more devices, less in demand elsewhere)
      • playing media (immediate read access; write access can be slower or restricted; make local temporary copy)
  • criticality: How important is it to make sure that this file is not lost?
    • This can range from "temporary copy" to "mission-critical".
  • versioning: On what schedule/scheme (if any) should old versions of this file be kept?

Ideally, FFS would determine the attributes of each file based on usage-patterns, but this will be done by a separate agent. There will also be an interface by which to set the initial (expected) availability attributes as well as to override the agent's algorithmic determinations.

FFS will move files around as needed to accommodate changing needs, as usage shifts and storage-device-space availability changes (due to being filled up, becoming unreliable, or other characteristics). It will also keep more extra copies, on different storage devices, of more critical data.

File Format Agnosticism

Traditional filesystems make certain assumptions about what metadata they need to track for each file – typically: creation date, modification date, file "extension" – which leaves applications to figure out how to present additional metadata they may need to store. This results in a proliferation of "file formats" which need to be understood in order to access that metadata or alter the file's contents without making it unusable.

FFS provides a mechanism for storing arbitrary semantic metadata, and a template system for repackaging that data back into a canonical file-format when needed. This allows applications and end-users to much more easily read and modify existing metadata as well as adding their own.

writing in progress

as an Application

Alternatively, these concepts can be implemented as a userland/desktop application rather than a core filesystem feature. This should make it possible to gain the benefits of these services without investing as much development time, allowing them to be tested and refined before being attempted as a kernel-level service.

Description

  • There is a database of all files in a one or more folders (could be the entire filesystem, or just selected folders). I'll call these "source folders".
  • Each record stores all the info found in the directory for its file, along with a hash of the contents.
  • You also give FileFerret one or more folders that it can control, i.e. store stuff however it wants -- doesn't have to be human-friendly. (Ideally, it would have areas on multiple drives -- one for every drive that you're using to archive stuff.) I'll call these "archive spaces".
  • You can mark each source folder as to whether it should be left alone, or should be "managed" -- which means, e.g., we want to remove unnecessary identical copies, especially of larger files, and just keep a record that the file was there (maybe replace it with a link, maybe don't bother).
  • You can also tag files/folders (either individually or by various criteria) as being of various levels of criticality --basically, how many backup copies do we want?
  • FF then looks for redundant files and removes them. If we want more copies than currently exist, it will make a copy in one or more of its archive spaces.
  • Once we have everything indexed and deduplicated, we can start using FF to manage new content -- instead of trying to organize it just with folders and naming, we can tell FF to just manage a given set of new files. It'll check for duplication, make sure that the requisite number of copies are made, and put them wherever there's room.

Process

Layer 1: The FFS app scans a user-specified set of folders, recording in a database every file found and logging any changes noted. Each file-record would include:

  • full path/name
  • timestamps
  • file size
  • file ownership, system attributes, etc.
  • contents fingerprint (hash, presumed to be unique)

During each scan, file entries would be checked against the DB by both hash and filespec, to identify changed contents or renaming. Every file found would be logged, along with any changes noted (i.e. anytime there is a partial match with an existing file, the current file will be logged as a modification of that file).

There would also be semantic data tables, to allow users to enter arbitrary information about each file as well as a set of criticality settings (e.g. how many copies should there be, to minimize the chance of losing it in a hardware failure).

Layer 2: FFS checks for duplicate files, and ensures that duplication for each file is within the limits described by the user -- that is, files with an insufficient number of backups will have additional copies made (on separate volumes), and over-duplicated files will be winnowed (either replaced with links or deleted, depending on various conditions).