Files, folders and versioning
Folder structure
The folder structure gives an overall picture of where data and other components are located. It should correspond to the project design and workflow. The design should for example separate ongoing and completed work and raw data from analyzed data.
It should be easy for present and future colleagues to find documentation that describes the decided folder structure and the conventions for file naming and versioning. An easy solution is to put a .txt format file (a ReadMe file) in the top folder of the structure.
Folder names should have a unique and self-explanatory name that is only as long as is necessary. Avoid assigning the same name to a folder and a subfolder.
Uppsala University has a recommended directory structure for projects that is adapted to the requirements of archiving that are imposed on universities as an authority.
Naming files
Choose a policy for naming files early in a project. File name should:
- have a transparent name describing the most important aspects of the content
- show how a specific file relates to other files
- be unique and follow a consistent pattern
- inform about content, status and version
Use underscore (_) to separate elements in file names and hyphen (-) or uppercase letters to increase readability. Avoid spaces, special and accented characters. Use standards such as ISO 8601 for date, time and time intervals.
Be consistent when using upper- and lowercase and specify the number of digits for files that need to be listed numerically, for example: 0001, 0002. This facilitates sorting and provides better machine-readability.
Consider your needs to sort files when selecting components in file names. Often it is better to go from general to more specific, e.g., ProjectAbbr_ExperimentNr_Location_Time_TypeOfData_VersionNr
File formats
As a researcher, you should choose the file formats that are best suited to the selected type of data collection and analytical method. However, an effort should be made to use file formats based on open and well-documented standards. Ideally, the formats should be accessible and readable in the long term, supplier-independent and non-proprietary. This facilitates when data is shared, reused and preserved. If necessary, the original file formats may need to be transferred to formats suitable for long-term storage and archiving.
Due to practices in certain disciplines and the need to use specific instruments and analysis tools, it is sometimes necessary to use proprietary vendor-dependent file formats.
Keep the following in mind when choosing file formats:
- Are there any area-specific recommendations?
- Is the software compatible with the systems provided by the University?
- How should data be generated and analyzed?
- Can you add metadata?
- Is the format suitable for sharing data?
- Is the format appropriate regarding long-term accessibility and readability?
- Does it work in all parts of the process, with as little need for conversion to other formats as possible?
For improved reusability file formats and software used should be described and documented.
Versioning
In projects where data, files and other digital components appear in different versions, it is important that you and your colleagues can keep different versions apart, that it is clear what a version contains and how it differs from other versions. Rules for versioning should be documented and included in a data management plan. In larger projects with many members, it may be appropriate to give one colleague the responsibility to ensure that naming and versioning guidelines are followed and updated.
Each new saved version of the data should be specified with a new version number (e.g. v01, v02, v03 etc.) and when needed also date when the file was created. Major changes to a file can be indicated with integers, for example v01 for the first version and v02 for the second version. Minor changes can be specified by adding parts to the file name, for example, v01_01, v02_02 and so on.
Important changes can be documented in a version control table. It allows you to specify what was changed, why and when, creating better traceability for data and results. Also remember to document how different versions of data relate to other components such as code, analysis method and workflow.
To manage versions of software code, any type of system based on Git is appropriate.
See also: Folder structure, file names, and versioning - Swedish National Data Service