discussion / Acoustics  / 14 September 2022

Data mgmt for Passive Acoustic Monitoring best practices?

Hello!

I'm running a small passive acoustic monitoring project for terrestrial species, using audiomoths and swifts. How do people and organizations manage the ballooning datasets of .wav/flac files? Should I be retaining lossless versions of all my recordings for posterity?

I currently have two 10TB external hard drives (ext hdd). My workstation internal storage is small (250gb). Right now I copy .wav files from the memory cards to ext hdd #1, back it up as a .flac  to ext hdd #2, and run analyses directly from ext hdd #1. I am concerned that I will eventually wreck ext hdd #1 with lots of reads and writes. I am already approaching the point where I need a third ext hdd to hold more data, but my organization is discouraging ext hdd use for several understandable reasons. I would like to be able to keep all the data I collect so other researchers can use it in the future, and host any derived detection data I make on a public repository. Is this what other people are doing?

Thanks!




I definitely feel you with the mounds of external hard drives, I have a stack of them sitting on my desk (in total something like 5TB). I don't really have a good answer. I've started experimenting with cloud storage (like Azure blob storage or AWS s3 buckets), but this can be expensive depending on how much data, how long you need it stored, what type of storage, etc. I've been using Azure and then pulling data from storage to VMs to run code. 

In terms of archiving, LILA BC is great for annotated ("ML-ready") datasets. The folks at Dataverse are usually pretty accommodating, but multiple terabytes might be a step too far (I believe they'll host up to 1TB for free).     

It doesn't help that there really aren't any metadata/compression standards for PAM data . Related to that, there is this great paper looking at effects of different compression schemes on acoustic signals.

Interested to hear what others think!

Hi Alex--

The first thing I'd suggest you think through is how much data you have vs how much data you are currently working on. Because if you have data from previous years that you want to ensure you're storing securely and reliably but don't need immediate access to in order to run analysis on, that opens up some options. You can compress data using lossless algorithms like FLAC, where the compression ratio varies but 50% is a pretty good margin, and then convert back to WAV if necessary for reanalysis. Compressing using MP3, OGG, AIFF, or other compression algorithms is an option that saves even more storage space but you will lose information in ways you wouldn't with FLAC--it depends on your specific needs.

I'd also recommend setting up a RAID array (RAID = "Redundant Array of Inexpensive Disks"). This offers some additional security in event of a drive failure. A lot of folks who do video editing, probably the most similar use case to people working with acoustic data who also lack the institutional support of a large company or university IT department use a local NAS enclosure like https://www.qnap.com/en-us/product/ts-433 that are designed for just this purpose. Some higher initial startup costs than just buying individual USB hard drives but that does come with some perks including additional reliability and can be faster to read data depending on the exact drive specs and your local networking setup.

There are also low-cost cloud storage services like Amazon's Glacier. However, getting these set up can be a little bit tricky and they are not particularly responsive (for example, if you upload data to Glacier, it will be very safe, but getting it back if you need to use it again can take a few days depending on the dataset size).

Hello Alex,

   My information might not be that helpful to you, still, our organisation have an Enterprise license of AWS cloud and we store all our media files (video, pictures, audio etc.) there. We are also using a media management solution, Piction, thru which we upload the files into the S3 bucket and in the process it also captures the file metadata (some of the metadata values needs to be entered manually). This is useful to search the files if someone wants to view or process the file later. We are soon deciding on the file storage configuration so that old files will move to cheap storage like AWS Glacier, which will take a maximum of a week time to retrieve it.  

Jitendra 

Hi Alex,

I'd go much further along the lines that David @dtsavage sets out. Before jumping to implementations, better think through why you want to keep all that data, and for who? From your question, it appears you have at least three purposes:

1- for yourself to do your research

2- for others to re-use.

3- for yourself to have a back-up

For 1) you should do what works best for you.

For 3) use your organization's back-up system or whatever comes close to that

For 2 and 3) As you are indicating yourself : deposit your data at your nation's repository or zenodo.org if your nation doe not have one. It may be some documentation work ( which is what you should do anyways, right? ), but then you can stop worrying about holding on to it. Someone else is doing that for you and they do a much better job - because it is their job. Moreover, you increase the chance that other will actually become aware of all that data that you are sitting on by putting it into a repository. Who is otherwise going to find out and how that you have those disks on your desk? Lastly, depositing your data can also serve as a back-up. If you don't want to share it before you've published about it, there is likely the option of depositing under time-embargo or of depositing while requiring your consent for any re-use. 

You ask how many people actually do this? You can find the answers at the repository, but I suggest that what matters most is whether you want to for your own reasons, and whether your funders, or organization's funders require it.