I'm running a small passive acoustic monitoring project for terrestrial species, using audiomoths and swifts. How do people and organizations manage the ballooning datasets of .wav/flac files? Should I be retaining lossless versions of all my recordings for posterity?
I currently have two 10TB external hard drives (ext hdd). My workstation internal storage is small (250gb). Right now I copy .wav files from the memory cards to ext hdd #1, back it up as a .flac to ext hdd #2, and run analyses directly from ext hdd #1. I am concerned that I will eventually wreck ext hdd #1 with lots of reads and writes. I am already approaching the point where I need a third ext hdd to hold more data, but my organization is discouraging ext hdd use for several understandable reasons. I would like to be able to keep all the data I collect so other researchers can use it in the future, and host any derived detection data I make on a public repository. Is this what other people are doing?
15 September 2022 4:22am
I definitely feel you with the mounds of external hard drives, I have a stack of them sitting on my desk (in total something like 5TB). I don't really have a good answer. I've started experimenting with cloud storage (like Azure blob storage or AWS s3 buckets), but this can be expensive depending on how much data, how long you need it stored, what type of storage, etc. I've been using Azure and then pulling data from storage to VMs to run code.
In terms of archiving, LILA BC is great for annotated ("ML-ready") datasets. The folks at Dataverse are usually pretty accommodating, but multiple terabytes might be a step too far (I believe they'll host up to 1TB for free).
It doesn't help that there really aren't any metadata/compression standards for PAM data . Related to that, there is this great paper looking at effects of different compression schemes on acoustic signals.
Interested to hear what others think!