discussion / Citizen Science  / 21 September 2022

Data upload solution for a citizen science project?

Hi everyone. 

I’m looking to build a web-based data upload server for a citizen science project and am wondering if there are out-of-the box solutions available or if there are some useful Python packages, libraries available to make the job easier? 

I don’t really want to reinvent the wheel and it seems that something like this may already exist and I’m just looking in the wrong place. 

The brief is that our volunteers make audio recordings to monitor threatened species, then upload their data for archiving and automated processing. I’d like a server that has the following: 

  1. Simple web-based user interface - many of our participants have limited confidence with computers; 
  2. No client-side software to install; 
  3. User management: registration to approved email addresses only (or similar, maybe a manual admin approval process); 
  4. Data files are 1 to 40MB in size but there are lots of them ~1000 files, and ~10 GB in total. If user loses network connection, uploads should be recoverable with the server capable of resuming an upload where it left off. That's quite important. 
  5. live progress and status updates to the user. 

I have access to a web hosting server. Maybe a Django or Flask implementation already exists, or there's something similar I could adapt.

 




There is Indicia, but maybe this is more work than you were hoping for?

Hi Jim,

What are your budget constraints?

How many concurrent users?

cheers

Frank

Based on your description, sounds like a shared Dropbox (et al) would do? I'm not sure how well they handle resumption of dropped connections. For that specifically, MASV does a great job. I realize that these are not self-hosted solutions, but if you have any budget at all, these are low cost and probably best to avoid trying to write yourself.

If you want to go down the Django route and have a nice system for managing things like this, the Wagtail CMS could be a nice fit as it'll do lots of this for you and leave you room to grow.

But this might fail on the restarting large multi-part upload part.

So might be best to write directly to an Amazon S3 buket and use their available libraries to manage the uploads (you'll probably want to use some cheap storage if handling a lot of files anyway). There's a guide linked below which covers everything in too much detail but I'm sure people have jumped off it to build stuff. I've used the Boto3 SDK to write to S3 a lot including lots of huge files but haven't accounted for drops in internet connection apart from just restarting the process when failed so can't be 100% confident on an answer there.