More Duplicate Records !

Submitted by Duncan Cooke on

I have just entered a record for Scotch Argus, Erebia aethiops, and then looked at 'Accepted as correct records only' where the are THREE identical records all showing the same photograph. It makes a mockery of the whole recording scheme.

Submitted by Barry Walter on Tue, 10/01/2023 - 20:13

Permalink

The records all show "Abundance 3", so it seems the user may have entered the same photo (using the iRecord App) for each butterfly they saw. Statistically speaking, this sort of thing will make very little difference in the long run, so even though it's not exactly ideal, it's not really making a "mockery" of anything. After all, there are many accepted records on iRecord with much broader estimates of abundance that do not provide any verifiiable evidence (or perhaps only a few representative photos). Likewise, there are also many accepted records of single specimens which do not register the true abundance at all (i.e. they're simply recording presence at a given date/location, rather than an explicit count).

Submitted by Duncan Cooke on Wed, 11/01/2023 - 01:11

Permalink

Sorry Barry, but I disagree. Surely the point is to record a species sighting, and when you do, you also fill in the 'Quantity', (3). There is also a comments box at the bottom where you can go into more detail if required. In my case I recorded 20 +. Are you saying I should have made 20 separate records, using the same photo for each ? Surely not. The point of uploading a photo is so an 'expert' can verify that your record is correct.... or not. A multitude of records can not be verified by a single photo.

In a previous post on the subject of duplicate records, R. Homan, (VC33 Moths), said this:

''Not had to deal with multiple use of the same photograph, but should it arise I would probably reject the records as it isn’t clear what instance is supported by the picture''.

To take it to the extreme, if you found a wood ant's nest containing up to 250,000 individual ants, would you enter a single record, or 250,000 separate ones ?

 

 

Submitted by Barry Walter on Wed, 11/01/2023 - 18:35

Permalink

I'm not sure what you disagree with, since all the points you made are in agreement with mine. Statistically speaking, it clearly makes no difference whether you post one photo with quantity=3, or three photos with quantity=1. If your record has a photo of a single butterfly with quantity=20, you are effectively using the same photo as evidence for all twenty. The verifier has to accept on trust that all the butterflies you saw were of the same species, since the evidence is necessarily incomplete.

Regarding the other post: I would say we should always assume that people are acting with good intentions and try to make the best use of the data we can. And more generally, we should always keep in mind that the vast majority of records are supplied by volunteers who may have spent many hours of their free time collecting, preparing and uploading their data. By contrast, it usually only takes a few minutes to review a record, so I would hope that verifiers don't reject any too hastily. A little common sense can go a long way when handling crowd-sourced data.

For my own part, I only ever upload records where the quantity matches what is shown in the photos, and never use broad estimates. For a Wood Ant nest, I would usually take the photo from close enough to make a rough count of all the individuals shown. The quantity isn't particularly meaningful, of course, since mound-building ants are quintessentially social organisms, so it makes more sense to count the nests.

 

Submitted by Duncan Cooke on Wed, 11/01/2023 - 22:42

Permalink

I think we are broadly in agreement Barry, but, when you say Statistically speaking, it clearly makes no difference whether you post one photo with quantity=3, or three photos with quantity=1 it clearly does make a difference statistically because when you look at 'Accepted as correct records only' my record is counted as a single record, ( even though when filling in the record I estimated quantity as 20 + ), where as the other person is showing three separate records and presenting to the verifier the same photo as evidence of each record, which is surely wrong. And because all three are accepted as correct and the quantity in each is given as 3, does that make a total of 9 ? It is misleading and confusing and makes you wonder how valid the record is.

When I make a record I generally put the quantity as 1 and submit photos from different angles to aid verification. If you leave quantity blank it wont accept your record and you have to go back and enter a number. It is my assumption that the 'quantity' section is there so that when people look at your record it gives people an idea of the population in the area where you submitted your record, (hence 20 +).  I suppose I am effectively using the same photo as evidence for twenty +, but the difference is, I only have 1 accepted as correct record, not 20, ( or 3 ).

So the point I'm making is that photos should only be used as evidence to support one individual record, not a multitude of separate ones.

 

Submitted by Barry Walter on Thu, 12/01/2023 - 18:55

Permalink

I think you are conflating different uses of the records. The principle purpose of iRecord is to act as a data warehouse. It receives raw data from many different sources, cleans it up in various ways, and re-distributes it to recording schemes and other interested parties. A secondary, and entirely separate, use is to create maps for display within the iRecord interface. These maps merely show the recording activity of individuals who use the phone app and/or the web portal to submit records. It therefore makes little sense to take those maps at face value and assume they are fully representative. Thus, when I say "statistically speaking", I'm only talking about how the processed data may be used downstream by third-parties - not the subset of it presented as maps in the iRecord interface.

Duplication is an inevitable consequence of accepting data from multiple sources. It's therefore hopelessly naive to assume there's always going to be a one-to-one mapping between reported sightings and all the individuals actually present at a given date/location. After all, several independent recorders may visit the same place on the same day, with each producing records of the same species, along with varying estimates of abundance. You cannot simply add up those estimates and expect to get an accurate total of the population that is actually present, since there's a high likelihood of significant overlap. Then again, if you ignore the estimates and only count what's shown in the photos, it's still not safe to assume that the same individual wasn't photographed multiple times by one or more of the recorders. Of course, most downstream data consumers are perfectly well-aware of this (as well as many other related issues), and will use much more sophisticated statistical models get an accurate picture of e.g. population-size within an given area. Needless to say, if the model isn't robust enough to cope adequately with the inherent messiness of the input, there's no point blaming the data (or the recorders).

As for your specific examples: I'm afraid I remain far from convinced that such minor inconsistencies could ever "make a mockery of the whole recording scheme", as you put it.

 

Submitted by Duncan Cooke on Thu, 12/01/2023 - 22:15

Permalink

So, in your opinion, Barry, should the three records in question be 'cleaned up' by having two of them deleted ? Or left as three, individual verified and accepted as correct records ? 

Submitted by Barry Walter on Fri, 13/01/2023 - 20:45

Permalink

If we're referring to the same records, from what I can see, the verifier reviewed them all within a minute or so of each other, then accepted them as correct and left comments thanking the recorder. So it seems they were satisfied with the records and saw no reason to make any changes.

In general, I just trust that the system is robust enough to handle a certain amount of duplication - so even though I don't like to see it, I don't really worry about it much. If iRecord had a lot more funding available, I'm sure that more could be done to minimise the duplication, and perhaps also clean up the databases. However, my hunch is that if they ever got that extra funding, this issue wouldn't be at the top of their todo list...

Submitted by Duncan Cooke on Fri, 13/01/2023 - 21:28

Permalink

With all due respect, you haven't answered my question.  In your opinion, Barry, should the three records in question be 'cleaned up' by having two of them deleted ?

Submitted by Barry Walter on Sat, 14/01/2023 - 15:02

Permalink

The opinion I gave is that, in general, I don't worry about it :-)

The issue of duplication was discussed at great length on the old iRecord forum when the import of records from iNaturalist was first started. The admins and several verifiers contributed to the thread, and partly as a result of that the importing was temporarily suspended. In the interrim, a few improvements were made, but when the imports were restarted, it quickly became clear that nothing had really changed regarding the problem with duplicates (as can be seen from this recent forum post). The conclusion I draw from all this is that the admins just don't consider the current level of duplication to be enough of a problem to be worth fixing. And looking forward, it's seems clear from this NBN data flow diagram that the situation is likely to become even more entrenched if/when iSpot starts contributing records.

Submitted by Barry Walter on Sat, 14/01/2023 - 18:18

Permalink

PS:

I just saw this other thread about repeated records. If the example that began this thread was caused by a similar system bug, then of course that should be fixed ASAP, and the superfluous records should be cleaned up. However, this is a separate issue to the one of duplication in general (i.e. resulting from normal usage), which is what I was mainly discussing above (since this is the General Discussion section of the forum).

PPS:

After investigating a little further, I think this really should have been posted as a bug report to the Feedback section of the forum. When I look at all the records entered by the user for that day (26/10/2002), I can now see that all of their records have been triplicated. If this was caused by a bug, are the admins aware of it? Has it already been fixed? And if so, was any consideration given to removing the extraneous records at the time?

 

Submitted by Duncan Cooke on Sat, 14/01/2023 - 19:25

Permalink

You still haven't answered my question. Just a simple yes or a no would do.

If you consider the most valuable records to be those that have been verified and accepted as correct records, which I assume you would, (though maybe not ? ), perhaps you would like to consider this; 

I recently entered a record of Crataerina hirundinis and when I looked at 'Accepted as correct records only' it was showing 25 records, but 2 were duplicates. That's 8%

I sent a notification to both and one has since been removed, but the other remains. 

I don't have the time the inclination or know-how to work out if this is a true indication of the level of duplicate records, it maybe lower, it maybe higher, but I do know it is quite easy to find duplicate records, with out trying very hard.

Perhaps admins need to do a proper assessment of the number of duplicates to decide if it is enough of a problem to be worth fixing, because if it is anywhere around 8% I think it probably is significant enough to do something about, wouldn't you think ?

The recent forum post you refer to here,  https://irecord.org.uk/forum/general-discussion/irecord-inaturalist indicates to me that there is clearly still a problem that needs to be addressed, although judging by the fact that this post has been up for a week and not had a single response and where the person asking the question is clearly looking for some guidance,  it would appear nobody really cares that much, or as you put it, admins just don't consider the current level of duplication to be enough of a problem to be worth fixing, which I think is a shame, if true.

 

    

 

Submitted by Barry Walter on Sat, 14/01/2023 - 22:30

Permalink

It was neither suggested nor implied that "nobody cares". It's simply a matter of deciding how best to make use of very limited funding. I don't speak for the admins, but it seems to me that they decided that it's not worth stopping the imports from iNaturalist merely because they don't have the funds to fix the duplication. Either that, or, due to the way the current system is implemented, it's just not feasible. (Probably it's a bit of both). And just to be clear: none of this has anything to do with the bug I mentioned in my previous post.

As for guidance regarding the relationship between iNaturalist and iRecord: please see this help page.

Submitted by Duncan Cooke on Sun, 15/01/2023 - 01:43

Permalink

No, you didn't say, neither suggested nor implied that "nobody cares", and I didn't say that you did.

But you did say;  ''The conclusion I draw from all this is that the admins just don't consider the current level of duplication to be enough of a problem to be worth fixing''. 

So there seem little point in discussing this any further. 

 

 

Submitted by Gustav Clark on Sun, 15/01/2023 - 12:12

Permalink

What is annoying is that de-duplication is very simple. There is a set of fields which are supplied in the original submision.  If all of these are identical then the record is a duplicate. In considering the magnitude of the task, indexes exist for date, taxon and OSGR.  You only need to look at the other fields once these 3 have been filtered.

 It requires some extra code, but it is very simple.  Thereafter it can be run either as a batch job for the whole database or on each record as it is submitted.    It should have been done as part of the iNaturalist import project.

 

Submitted by Barry Walter on Wed, 18/01/2023 - 19:03

Permalink

Those three fields clearly aren't sufficient. Let's say a recorder stands in one spot, takes photos of six different individuals of the same species, and then submits them all as separate records. This could easily produce six records with exactly the same taxon, date and coordinates (rounded to a given precision). In fact, it's very likely that the only significant difference between them would be the photos themselves - which are much more difficult to compare. Thus, in general, there's no completely reliable way to identify genuine duplicates using the textual information alone.

Now, if those same six records were then uploaded to iNaturalist and later imported to iRecord, even the photos would be the same. The only salient distinction would be the user/display names. But there's no guarantee that every recorder will use the same names for both their accounts, and even if they did, there's still no guarantee that the names will be unique across both platforms. Thus, without a mechanism to explicitly link the two accounts, there's no way to know for certain whether an identical record submitted to both iRecord and iNaturalist originated from the same recorder.

But in any case: what exactly does it mean to say that two records are "the same"?

On iNaturalist it's perfectly legitimate for two people to submit exactly the same data using separate accounts, because the basic unit of interaction is the personal observation rather than the specimen record. Thus, a couple out walking together could take a single photo of a bird, and then each upload it to their own account as a unique observation. The data may be identical, but iNaturalist does not treat this as duplication, because each observer is recording their own encounter with the bird. And what is more, the two observers may neither know nor care whether their observations are later treated as specimen records by some unknown third-party organisation (like iRecord). This all goes to show that the perception of duplication is context dependent, so there's really no simple answer to what counts as being "the same".