The following comparisons are used (in this order, chosen for performance reasons):
This 1 year margin was chosen because "ahead of print" publications and final publications often are not published in the same year. A margin larger than 1 year would make the program a lot slower (see Justification: Order of comparisons).
If the Starting and Ending page of at least one of the publications are more than 2 pages
apart,
then: the DOIs are compared first. If the DOIs are different or one or both are absent, then
the starting pages are compared,
otherwise: the starting pages are compared first. If the starting pages are different or one
or both are absent, the DOIs are compared.
Meeting abstracts often get the DOI of the whole conference proceedings. Comparing them by DOI produces a lot of False Positives. Using this "more than 2 pages" choice circumvents this problem.
Jaro-Winkler similarity: See this Wikipedia page for a definition and here for some examples.
The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section
(T3, see below) are treated as titles.
Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the
beginning of a string, the normalized titles are also reversed if the publication is longer
than 1 page
(i.e. except for meeting abtracts, replies, retraction notices, ...).
This rule is skipped if both records have the same DOI (that comparison was made in step
2).
The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3,
see below) are treated as journals. All ISBNs, ISSns and journal titles
(including abbreviations) in the records are used.
If both records have an ISBN, the ISBNs are compared (stop), if both have an ISSN, the ISSns
are compared (stop), else the journal titles are compared.
Abbreviated and full journal titles are compared in a sensible way (see Examples of comparisons).
In the following table the results of EndNote's Find duplicates is compared to the comparisons in DedupEndNote. For these tests only one field was selected in EndNote in "Edit > Preferences > Duplicates", and "Ignore spacing and punctuation" was selected.
| Field | Examples | EndNote finds duplicates | DedupEndNote Score |
|---|---|---|---|
| Starting page and article number |
|
??? | ??? |
| Title |
|
Yes | 1.00 == Yes |
| Title |
|
No | 0.92 == Yes |
| Title |
|
No | 0.92 == Yes |
| Title |
|
No | 0.99 == Yes |
| Title |
|
No | 0.96 == Yes |
| Title |
|
No | 0.97 == Yes * |
| Title |
|
No | 1.00 == Yes * |
| Title |
|
No | 1.00 == Yes * |
| Title |
|
No | 1.00 == Yes * |
| Title |
|
No | 0.96 == Yes * |
| Title |
|
No | 0.91 == Yes * |
| Authors |
|
No | 0.75 == Yes |
| Authors |
|
No | 0.93 == Yes |
| Authors |
|
No | 0.90 == Yes |
| Authors |
|
No | Yes |
| Journal |
|
No | Similar == Yes |
| Journal |
|
No | Similar == Yes |
| Journal + ISSN |
|
No | Similar == Yes |
| Journal |
|
No | Similar == Yes |
| Journal |
|
No | Similar == Yes |
| Journal |
|
No | Similar == Yes * |
| Journal |
|
No | Similar == Yes |
| Journal |
|
No | NOT similar == No |
*: In these cases the comparison of DedupEndNote for this content for this field is not accurate. However, the comparison of the other fields for these records does not result in YES answers, so the records are ultimately not considered duplicates.
Only the first record of a set of duplicate records is copied to the output file.
When writing the output file, the following fields will be changed:
| Author (AU) |
|
| DOI (DO): |
|
| Journal name (T2) |
|
| Publication year (PY): |
|
| Starting page (SP) and Article Number (C7): |
|
| Title (TI): |
|
The general rule is:
| Comparison | Result | Action |
|---|---|---|
| 1 ... 5 | YES | go to next comparison if
present, else mark the records as duplicates |
| (insufficient data for the comparison in one of the records) | ||
| NO | stop comparisons for this pair of record |
Justification:
then the absence of Starting Page / DOI / Authors / ... in an EndNote record would mean that the corresponding publication has no Starting Page / DOI / Authors / .... Alas, the world isn't perfect yet.
DedupEndNote therefore interprets a comparison with insufficient data from one of two records not as "NO" (i.e. these records are different), but as "UNKNOWN" (i.e. we can't tell (yet)), and continues with the other comparisons for these records.
As a consequence, fields which are useful in comparisons but are not always available (e.g. DOI), can be used.
NOTE: If both records have insufficient data for a comparison, DedupEndNote interprets the fields as different. If there is no alternative for the comparison (see below), then DedupEndNote stops the comparisons for this pair of records and considers them different publications. See 2.5. Effect of insufficient data from one of two records.
The EndNote fields used in the comparisons are:
| EndNote field | Content | Treated as | Used in comparison no. |
|---|---|---|---|
| PY | Publication year | Year | 1 |
| SP | Pages | Starting page | 2 |
| C7 | Article number | ||
| DO | DOI | DOI | |
| AU | Authors | Authors | 3 |
| TI | Title | Title | 4 |
| OP * | Original title (when reference type is not Conference Proceedings) | ||
| ST | Short title | ||
| T3 ** | Conference title Alternate journal title Original title |
||
| SN | ISSN or ISBN | ISBN ISSN |
5 |
| T2 | Journal title Book title |
Journal | |
| J2 | Alternate journal | ||
| OP * | Conference title (when reference type is Conference Proceedings) | ||
| T3 ** | Conference title Alternate journal title Original title |
||
| VL *** | Can be last part of journal title (T2) |
*: the field OP can be a title variant or a journal variant.
**: Conference titles in T3 are omitted.
***: VL can be the last part of the T2 field ("T2 - American journal of physiology" and "VL - Regulatory, integrative and comparative physiology. 303").
Justification:
In the test database of 52.000 records 11% of the records were of non-English origin
Justification:
DedupEndNote uses Jaro-Winkler Similarity instead of equality when comparing some fields (Authors, Titles, partly with Journals).
Justification:
Why Jaro-Winkler Similarity (JWS) and not Levenshtein distance / ...?
Thresholds used for Jaro-Winkler Similarity:
| Field | Case | Threshold | Explanation |
|---|---|---|---|
| Authors | default (i.e. not a Reply) | 0.67 | |
| Reply and sufficient Start Pages | 0.75 | When a Reply the titles are not compared | |
| Reply and insufficient Start Pages | 0.8 | When a Reply the titles are not compared | |
| Journals * | default (i.e. not a Reply) | 0.9 | |
| Reply | 0.93 | ||
| Title | Sufficient Start Pages or DOIs | 0.89 | |
| Insufficient Start Pages and DOIs | 0.94 |
Records are put in year sets based on the publication year. Records without publication year are put in a special year set YEAR_0.
When deduplicating 1 file, records are compared in descending order of pairs of year sets. The records in YEAR_0 are added to each of these pairs except for the ones which are already marked as duplicates. For an EndNote RIS file with records from 1889 to 2020:
Justification:
When deduplicating 2 files, records of both files are compared in ascending order of
pairs of
year sets. The YEAR_0 is added to each of these pairs, but records which are marked
as
duplicates are first removed.
The records of the OLD file are read before the records of the NEW file; because the
duplicate
chosen in a set of duplicate records is the first one encountered, duplicate records
from
the
OLD file will be chosen when present.
For an EndNote RIS file with records from 1889 to 2020:
The output file only contains records from the NEW file which are not duplicates of records of the OLD file, and (if there are duplicates within the NEW file) are the first duplicate encountered within that duplicate set.
Justification:
The 5th comparison (ISSN or Journal: Are they the same (ISSN) or similar (Journal)?) looks at first sight to only compare journal articles (with the additional effect that no publications of another type can ever be a duplicate). This is not completely true:
However: the general rule treats 2 records / field sets as different if both records have insufficient data for that comparison. Two book records with the same authors, publication year and book title will be considered duplicates only if both have the same / a similar ISBN.
Relaxing this general rule (so that comparisons with insufficient data in one or both records are treated the same way: UNKNOWN, so go on to the next comparison) would result in a lot more False Positives.
Justification:
Data are from:
| Name | Tool | True pos | False neg | Sensitivity | True neg | False pos | Specificity | Accuracy |
|---|---|---|---|---|---|---|---|---|
| SRA: Cytology screening (1856 rec) |
EndNote X9 | 885 | 518 | 63.1% | 452 | 1 | 99.8% | 72.0% |
| SRA-DM | 1265 | 139 | 90.1% | 452 | 0 | 100.0% | 92.5% | |
| DedupEndNote | 1361 | 59 | 95.8% | 436 | 0 | 100.0% | 96.8% | |
| SRA: Haematology (1415 rec) | EndNote | 159 | 87 | 64.6% | 1165 | 4 | 99.7% | 93.6% |
| SRA-DM | 208 | 38 | 84.6% | 1169 | 0 | 100.0% | 97.3% | |
| DedupEndNote | 222 | 6 | 97.3% | 1186 | 1 | 99.9% | 99.5% | |
| SRA: Respiratory (1988 rec) |
EndNote X9 | 410 | 391 | 51.2% | 1185 | 2 | 99.8% | 80.2% |
| SRA-DM | 674 | 125 | 84.4% | 1189 | 0 | 100.0% | 93.7% | |
| DedupEndNote | 768 | 18 | 97.7% | 1202 | 0 | 100.0% | 99.0% | |
| SRA: Stroke (1292 rec) |
EndNote X9 | 372 | 134 | 73.5% | 784 | 2 | 99.7% | 89.5% |
| SRA-DM | 426 | 81 | 84.0% | 785 | 0 | 100.0% | 93.7% | |
| DedupEndNote | 497 | 8 | 98.4% | 787 | 0 | 100.0% | 99.4% | |
| McKeown (3130 rec) |
OVID | 1982 | 90 | 95.7% | 1058 | 0 | 100.0% | 97.1% |
| EndNote | 1541 | 531 | 74.4% | 850 | 208 | 80.3% | 76.4% | |
| Mendeley | 1877 | 195 | 90.6% | 1041 | 17 | 98.4% | 93.2% | |
| Zotero | 1473 | 599 | 71.1% | 1038 | 20 | 98.1% | 80.2% | |
| Covidence | 1952 | 120 | 94.2% | 1056 | 2 | 99.8% | 96.1% | |
| Rayyan | 2023 | 49 | 97.6% | 1006 | 52 | 95.1% | 96.8% | |
| DedupEndNote | 2023 | 33 | 98.4% | 1074 | 0 | 100.0% | 98.9% | |
| BIG_SET (5082 rec) |
DedupEndNote | 3952 | 92 | 97.7% | 1030 | 8 | 99.2% | 98.0% |
See Test results - details for a description of this test database BIG_TEST on portal vein thrombosis.
| Tool | Setting | Duplicates found | Duplicates to delete | After deduplication | % kept |
|---|---|---|---|---|---|
| EndNote | Author + Year + Title + Reference Type (default setting) |
32,891 | 19,959 | 32,869 | 62% |
| EndNote | Author + Year + Title | 32,920 | 19,976 | 32,852 | 62% |
| EndNote | Author + Year + Title + Secondary Title (Journal) | 22,120 | 12,333 | 40,495 | 77% |
| DedupEndNote | 38,617 | 24,357 | 28,471 | 54% |
The False Positives are all conference abstracts:
If you want to manually merge the records which are duplicates according to DedupEndNote, you can use Mark mode.
Because Zotero does not work with / does not show record numbers, Mark Mode is not relevant for Zotero users.
In Mark mode the ID of the first record of a set of duplicate records is copied to the label field ("LB") of all duplicate records. The input file is copied to the output file with the addition of the Label field if it is not empty. The original content of the Label field is overwritten! All other fields are copied as is (so no enriching of the output: prescribed form of DOI, ...).
After importing the results file into a new EndNote database, make the Label field visible. The IDs in the Label field refer to the IDs of the original EndNote database.
Comparing the results of deduplication by EndNote itself and DedupEndNote:
EndNote export files in RIS format do not contain grouping information!
If you have groupings, ... in your EndNote database before you started using DedupEndNote: use DedupEndNote in MarkMode, and deduplicate then within EndNote only on the Label field.
TODO: EXPAND
TODO: XML (formatting)
However, for new projects the following will work: using a MASTER Database and a WORKING
Database
(The names of the EndNote databases and the RIS files are just examples)
When you have new export files from the same or other bibliographic databases:
TODO: What about manual adds?
TODO: What about updates of ahead-of-print publications?
See previous question
DedupEndNote compares records in the order of the input file, after they are grouped on publication year. This makes it possible to influence the choice of duplicate within a duplicate set.
Suppose all records have the field Database Provider filled in with either "PubMed" or "EMBASE". If you order the orginal EndNote database on the field Database Provider in descending order before exporting to a RIS file, DedupEndNote will use a PubMed record if a duplicate set contains one or more PubMed records and one or more EMBASE records (except if one or more EMBASE records of this duplicate set has a publication year one year later than all PubMed records). TODO: Explain why "later"?
Your preference will probably not be the same as the alphabetical order of the names of database providers ("CINAHL", "Cochrane ...", "EMBASE", ...). You could however change this field in "1 - PubMed", "2 - EMBASE", "3 - Cochrane", ... and order your EndNote database on this field (EndNote will order numerically) before exporting to a RIS file, or use another field with similar content.
Of course not.
However, DedupEndNote tries to avoid at all costs to wrongly identify records as duplicates. The MarkMode was specifically developed for this, making comparison between results of deduplication by EndNote and by DedupEndNote feasible.
There was only 1 false positive found in the validation sets (11,474 validated records):
DedupEndNote will miss some duplicate records, e.g.:
After deduplication by DedupEndNote, deduplication with EndNote itself (default setting: Author, Year, Title) could be worthwhile. The case of different starting pages can only be solved by looking up the publication itself, the case of the missed similar journals by your brain.
The next results are for version 1.0.0 / these results should be updated for the latest
version:
After deduplicating our test database of 52,828 records with DedupEndNote and importing the
result file into a new EndNote database, deduplication with EndNote itself (default setting:
Author, Year, Title) found 260 duplicates. After manual inspection, we found 118 records
which were true duplicates. Some of the "reasons":
2025-08-01: the change in version 1.0.2 (comparison 5 (ISSN and Journal names) is skipped when the DOIs are the same) solves some (a lot?) of these problems. But: the journal name used in the deduplicated file is the one from the first record in the duplicate set.
Some cases of duplicate records are "unsolvable" for programs? Take e.g.the following publication (https://www.nejm.org/doi/full/10.1056/NEJM199105303242207):
"Cabot RC, Scully RE, Mark EJ, McNeely WF, McNeely BU" in (1) were the (associate) editors of the NEJM in 1991, "Podolsky DK, Lewandrowski KB" in (2) described the case, all authors in (3) discussed the case.
A relational database was used for validation.
The RDBM used was MS Access (MS Office 2016):
The table for a validation set contains the fields:
| Field | Type | Default | Content |
|---|---|---|---|
| id | INTEGER | The original ID in the EndNote DB. PRIMARY KEY | |
| dedupid | INTEGER | NULL | Content of the Label field in Mark mode, i.e. the ID of the first record in a duplicate set |
| correction | INTEGER | NULL | Manually set for the False Positive (FP) and False Negative (FN) results (see below) |
| validated | BOOLEAN | FALSE | Manually set to TRUE if the DedupEndNote result is validated |
| tp | BOOLEAN | FALSE | Manually set to TRUE if record is indeed a duplicate of the record with DedupID |
| tn | BOOLEAN | FALSE | Manually set to TRUE if record has no duplicates |
| fp | BOOLEAN | FALSE | Manually set to TRUE if DedupEndNote has wrongly identified the record as a
duplicate of
record with DedupID. If the record has no duplicates, Correction contains the ID, otherwise the ID of the true duplicate. |
| fn | BOOLEAN | FALSE | Manually set to TRUE if DedupEndNote has not identified the record as a
duplicate. The ID of the missed duplicate is stored in Correction. If the record is a False Positive but also has duplicates, it is only marked as False Positive: otherwise TP + TN + FP + FN would be greater than the size of the validation set. |
| unsolvable | BOOLEAN | FALSE | ??? |
| authors_truncated | TEXT |
Authors joined with '; ', truncated at 254 characters In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable. |
|
| authors | TEXT |
Authors joined with '; ' In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable. |
|
| publ_year | TEXT | Publication Year | |
| title_truncated | TEXT |
Title, truncated at 254 characters In an MS Access DB: SHORT TEXT (i.e. max. 255 characters), to make the field sortable and searchable. |
|
| title | TEXT |
Title In an MS Access DB: LONG TEXT (a.k.a. MEMO), not sortable or searchable. |
|
| title2 | TEXT | Journal Title / Book Title | |
| volume | TEXT | Volume | |
| issue | TEXT | Issue | |
| pages | TEXT | Starting Page | |
| article_number | TEXT | Article Number | |
| dois | TEXT | DOIs joined with '; ' | |
| publ_type | TEXT | Type of publication. 'type' is a SQL reserved word | |
| database | TEXT | Database Provider | |
| number_authors | INTEGER | Number of authors |
java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jarIf everythings works, the log should end with "Publications read: ". If not, the log will end with a message with the Record ID and title of the last publication that was successfully read.
java -Dlogging.level.edu.dedupendnote.services.IOService=DEBUG -Dlogging.level.edu.dedupendnote.domain.Publication=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jar
java -Dlogging.level.edu.dedupendnote.services=DEBUG -jar DedupEndNote-0.9.7b-SNAPSHOT.jar
Do not set the level to TRACE. The log file will be flooded, and the program may come to
a
halt.
The test file MissedDuplicatesTests.java uses the level TRACE.