Skip to content

Commit 7b87248

Browse files
authored
Merge pull request #226 from ipfs/unixfs-1.5-metadata-in-files
docs: adds spec sections for mode and mtime metadata
2 parents 51577eb + c4899dc commit 7b87248

File tree

1 file changed

+112
-6
lines changed

1 file changed

+112
-6
lines changed

UNIXFS.md

Lines changed: 112 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) UnixFS
1+
# ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) UnixFS <!-- omit in toc -->
22

33
**Author(s)**:
44
- NA
@@ -11,15 +11,30 @@ UnixFS is a [protocol-buffers](https://developers.google.com/protocol-buffers/)
1111

1212
Draft work and discussion on a specification for the upcoming version 2 of the UnixFS format is happening in the [`ipfs/unixfs-v2` repo](https://github.com/ipfs/unixfs-v2). Please see the issues there for discussion and PRs for drafts. When the specification is completed there, it will be copied back to this repo and replace this document.
1313

14-
## Table of Contents
15-
16-
TODO
14+
## Table of Contents <!-- omit in toc -->
15+
16+
- [Implementations](#implementations)
17+
- [Data Format](#data-format)
18+
- [Metadata](#metadata)
19+
- [Deduplication and inlining](#deduplication-and-inlining)
20+
- [Importing](#importing)
21+
- [Chunking](#chunking)
22+
- [Layout](#layout)
23+
- [Exporting](#exporting)
24+
- [Design decision rationale](#design-decision-rationale)
25+
- [Metadata](#metadata-1)
26+
- [Separate Metadata node](#separate-metadata-node)
27+
- [Metadata in the directory](#metadata-in-the-directory)
28+
- [Metadata in the file](#metadata-in-the-file)
29+
- [Side trees](#side-trees)
30+
- [Side database](#side-database)
1731

1832
## Implementations
1933

2034
- JavaScript
2135
- Data Formats - [unixfs](https://github.com/ipfs/js-ipfs-unixfs)
22-
- Importers and Exporters - [unixfs-engine](https://github.com/ipfs/js-ipfs-unixfs-engine)
36+
- Importer - [unixfs-importer](https://github.com/ipfs/js-ipfs-unixfs-importer)
37+
- Exporter - [unixfs-exporter](https://github.com/ipfs/js-ipfs-unixfs-exporter)
2338
- Go
2439
- [`ipfs/go-ipfs/unixfs`](https://github.com/ipfs/go-ipfs/tree/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs)
2540
- Protocol Buffer Definitions - [`ipfs/go-ipfs/unixfs/pb`](https://github.com/ipfs/go-ipfs/blob/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs/pb/unixfs.proto)
@@ -43,9 +58,10 @@ message Data {
4358
optional bytes Data = 2;
4459
optional uint64 filesize = 3;
4560
repeated uint64 blocksizes = 4;
46-
4761
optional uint64 hashType = 5;
4862
optional uint64 fanout = 6;
63+
optional uint32 mode = 7;
64+
optional int64 mtime = 8;
4965
}
5066
5167
message Metadata {
@@ -59,6 +75,27 @@ For files that are comprised of more than a single block, the 'Type' field will
5975

6076
This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object.
6177

78+
For files comprised of a single block, the 'Type' field will be set to 'File', 'filesize' will be set to the total number of bytes in the file and the file data will be stored in the 'Data' field.
79+
80+
## Metadata
81+
82+
UnixFS currently supports two optional metadata fields:
83+
84+
* `mode` -- The `mode` is for persisting the file permissions in [numeric notation](https://en.wikipedia.org/wiki/File_system_permissions#Numeric_notation) \[[spec](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html)\].
85+
If unspecified this defaults to `0755` for directories/HAMT shards and `0644` for all other types where applicable
86+
The nine least significant bits represent `ugo-rwx`
87+
The next three least significant bits represent `setuid`, `setgid` and the `sticky bit`
88+
All others are reserved for future use
89+
* `mtime` -- The modification time in seconds since the epoch. This defaults to the unix epoch if unspecified
90+
91+
### Deduplication and inlining
92+
93+
Where the file data is small it would normally be stored in the `Data` field of the UnixFS `File` node.
94+
95+
To aid in deduplication of data even for small files, file data can be stored in a separate node linked to from the `File` node in order for the data to have a constant [CID] regardless of the metadata associated with it.
96+
97+
As a further optimization, if the `File` node's serialized size is small, it may be inlined into its v1 [CID] by using the [`identity`](https://github.com/multiformats/multicodec/blob/master/table.csv) [multihash].
98+
6299
## Importing
63100

64101
Importing a file into unixfs is split up into two parts. The first is chunking, the second is layout.
@@ -86,3 +123,72 @@ If there is only a single chunk, no intermediate unixfs file nodes are created,
86123
## Exporting
87124

88125
To read the file data out of the unixfs graph, perform an in order traversal, emitting the data contained in each of the leaves.
126+
127+
## Design decision rationale
128+
129+
### Metadata
130+
131+
Metadata support in UnixFSv1.5 has been expanded to increase the number of possible use cases. These include rsync and filesystem based package managers.
132+
133+
Several metadata systems were evaluated:
134+
135+
#### Separate Metadata node
136+
137+
In this scheme, the existing `Metadata` message is expanded to include additional metadata types (`mtime`, `mode`, etc). It then contains links to the actual file data but never the file data itself.
138+
139+
This was ultimately rejected for a number of reasons:
140+
141+
1. You would always need to retrieve an additional node to access file data which limits the kind of optimizations that are possible.
142+
143+
For example many files are under the 256KiB block size limit, so we tend to inline them into the describing UnixFS `File` node. This would not be possible with an intermediate `Metadata` node.
144+
145+
2. The `File` node already contains some metadata (e.g. the file size) so metadata would be stored in multiple places which complicates forwards compatibility with UnixFSv2 as to map between metadata formats potentially requires multiple fetch operations
146+
147+
#### Metadata in the directory
148+
149+
Repeated `Metadata ` messages are added to UnixFS `Directory` and `HAMTShard` nodes, the index of which indicates which entry they are to be applied to.
150+
151+
Where entries are `HAMTShard`s, an empty message is added.
152+
153+
One advantage of this method is that if we expand stored metadata to include entry types and sizes we can perform directory listings without needing to fetch further entry nodes (excepting `HAMTShard` nodes), though without removing the storage of these datums elsewhere in the spec we run the risk of having non-canonical data locations and perhaps conflicting data as we traverse through trees containing both UnixFS v1 and v1.5 nodes.
154+
155+
This was rejected for the following reasons:
156+
157+
1. When creating a UnixFS node there's no way to record metadata without wrapping it in a directory.
158+
159+
2. If you access any UnixFS node directly by its [CID], there is no way of recreating the metadata which limits flexibility.
160+
161+
3. In order to list the contents of a directory including entry types and sizes, you have to fetch the root node of each entry anyway so the performance benefit of including some metadata in the containing directory is negligible in this use case.
162+
163+
#### Metadata in the file
164+
165+
This adds new fields to the UnixFS `Data` message to represent the various metadata fields.
166+
167+
It has the advantage of being simple to implement, metadata is maintained whether the file is accessed directly via its [CID] or via an IPFS path that includes a containing directory, and by keeping the metadata small enough we can inline root UnixFS nodes into their CIDs so we can end up fetching the same number of nodes if we decide to keep file data in a leaf node for deduplication reasons.
168+
169+
Downsides to this approach are:
170+
171+
1. Two users adding the same file to IPFS at different times will have different [CID]s due to the `mtime`s being different.
172+
173+
If the content is stored in another node, its [CID] will be constant between the two users but you can't navigate to it unless you have the parent node which will be less available due to the proliferation of [CID]s.
174+
175+
2. Metadata is also impossible to remove without changing the [CID], so metadata becomes part of the content.
176+
177+
3. Performance may be impacted as well as if we don't inline UnixFS root nodes into [CID]s, additional fetches will be required to load a given UnixFS entry.
178+
179+
#### Side trees
180+
181+
With this approach we would maintain a separate data structure outside of the UnixFS tree to hold metadata.
182+
183+
This was rejected due to concerns about added complexity, recovery after system crashes while writing, and having to make extra requests to fetch metadata nodes when resolving [CID]s from peers.
184+
185+
#### Side database
186+
187+
This scheme would see metadata stored in an external database.
188+
189+
The downsides to this are that metadata would not be transferred from one node to another when syncing as [Bitswap] is not aware of the database, and in-tree metadata
190+
191+
[multihash]: https://tools.ietf.org/html/draft-multiformats-multihash-00
192+
[CID]: https://docs.ipfs.io/guides/concepts/cid/
193+
[Bitswap]: https://github.com/ipfs/specs/blob/master/BITSWAP.md
194+
[MFS]: https://docs.ipfs.io/guides/concepts/mfs/

0 commit comments

Comments
 (0)