How do you report potential duplicates?

Submitted by Mesh on

I've been looking at records of Nomad bees lately and have come across a few instances where I may have discovered some duplicates. A recent search brought up these two records Record ID 20488779 and Record ID 20486781 which on the face of it seem to be the same sighting. I searched all records for Nomada flavoguttata and then manually zoomed the map into the Yorkshire/Lancashire area.

Will these duplicates eventually be picked up by the system; is there some sort of automated data cleanser? I could add a comment to 20486781 but the record is from 4 years ago so I don't even know if that user is still active and if they are if they would react to my comment? Or is this a fault in the "explore all records" page? I think if you delete a record it isn't actually expunged from the database just marked as deleted somehow. I seem to remember deleted records don't show up in a normal search (for obvious reasons) but I could still bring up a previously deleted record if I knew the record ID and searched for it specifically. Both these records appear to have been updated by the user so maybe he deleted the first one for some reason. Is there somewhere where I should be flagging-up potential duplicates if I come across them?

Comments

Submitted by Jim Alder on Thu, 11/12/2025 - 06:51

Permalink

Duplicates often occur when recorders add a record onto iRecord and then again on iNat, the latter of which is then migrated to iRecord.  I tend to spot these for the species I verify and mark the iNat one as not accepted with a comments stating why ie duplication.  Sometimes folk add record to iRecord on their phone app and then use a taxon specific recording form, not realising it is also an irecord form.

The NBN Atlas (which irecord eventually feeds into) is absolutely riddled with duplicates - often the same record occurring 3 or 4 times.  I often look at downloaded NBN data for specific species for my county and think a species has been recorded many times initially, but then spot multiple records with the same date/grid ref etc.  I wouldnt be surprised if around 30% of the NBN records are duplicates based on my experience. 

Maybe one day they'll get some AI thing to strip out all the duplicates and clean the dataset up.....

 

Submitted by OliP_CEH on Thu, 11/12/2025 - 10:57

Permalink

It's perhaps worth pointing out that most applied uses of biological records will deal with duplication in some way. Mapping using a given spatial and temporal scale (e.g. hectad maps using broad date-classes) intrinsically hides duplicates, and most statistical analyses of biological records will reduce or aggregate data to some common spatio-temporal scale before analysis. It is only what one might call "naive" analyses that are troubled by duplicates (e.g. counting actual records and assuming they index something useful). As data manager of the British Bryological Society, I therefore tend to spend very little time worrying about duplicates. Of course, I accept that they can sometimes be annoying (e.g. bringing rejected records back into a live dataset somehow, or meaning that several records have to be corrected when record duplicates share mistakes), but in my experience it is almost never worth putting the (often very large) effort into general removal of duplicates relative to the (small) effort required to deal with them sensibly when one is doing something applied with the dataset. Just my opinion and experience anyway.

Submitted by Barry Walter on Thu, 11/12/2025 - 15:29

Permalink

Duplicates seem like a minor irritation with very little statistical signficance, compared with some of the other records in the database. To see what I mean, go to the Explore All page, zoom out, and click on a few of those circles in the middle of the ocean, or e.g. in Africa. It doesn't take long to find dozens of records with major data accuracy issues. Picking one at random, what are we to make of 10196697, which is an accepted record located in the middle of the Gulf of Guinea, with species Bombus, and the quantity shown as 500? The site and habitat suggest a park or garden in Weston-Super-Mare, so I suppose the record isn't completely useless. However, examples like these (of which there are likely to be thousands), show just how little correspondence there may be between what is actually recorded in the database, and what is eventually made of it by verifiers and downstream consumers.

What does it even mean for someone to say they saw five hundred bumblebees, even if all the other data is accurate? How on earth could anyone verify that? Personally, I never record the quantity as anything other than what is shown in the photos (which I always include). I'm sure that some people are capable of making reliable estimates, but it seems reasonable to assume that the vast majority of modern citizen science data never even includes abundance estimates (let alone accurate ones). I don't know whether there's been any recent studies regarding this, but it wouldn't surprise me if, in general, abundance data tends to drastically under-estimate what was actually present, because so many people just record what they were able to get a photograph of, rather than everything they actually saw. If this hunch is broadly correct, it would probably take a quite large number of duplicates to make any real difference (even if some of them included abundance estimates).

The vast majority of records accepted by iRecord don't have photographs, so how should duplicates be recognised? Since the time isn't recorded, it's possible for multiple records from the same day to have identical data. Or the same specimens could be recorded by different people using otherwise identical data. And so on, and on. Once these kinds of possiblilities are amplified by all the different sources iRecord takes data from, it starts to become clear why attempting to eliminate duplicates is barely worth even considering. It's just a necessary minor evil of processing the relatively large amounts of data that have become available in recent times.

I suppose if duplicates really bother you, there's no reason not to add a comment - assuming the records are truly identical (including the source dataset). Presumably, most recorders will want to know if they've mistakenly uploaded the same record twice, even if it's too late to do much about it (i.e. if it's already been reviewed).