DedupEndNote (version 1.0.1 20240114)

1. INPUT FILE


2. START

3. RESULT

Progress

Waiting for new input file ...

Steps

  1. Deduplicate one file:
    • export an EndNote database into a file in RIS format
    • upload this file in DedupEndNote
    • save the results file with deduplicated records
    • import this results file into a new EndNote database
  2. Deduplicate a new file against an existing file / EndNote database: see /twofiles

Why DedupEndNote?

Deduplication in EndNote misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.

DedupEndNote deduplicates an EndNote RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote database. It is more forgiving than EndNote itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Test results").

The program has been tested on EndNote databases with records from:

  • CINAHL (EBSCOHost)
  • Cochrane Library (Trials)
  • EMBASE (OVID)
  • Medline (OVID)
  • PsycINFO (OVID)
  • PubMed
  • Scopus
  • Web of Science (very few tests with conference papers)

What does DedupEndNote do?

1. Deduplicate records in a RIS file

Each pair of records is compared in 5 different ways. The general rule is:

Comparison Result Action
1 ... 5 YES go to next comparison if present,
else mark the records as duplicates
(insufficient data for comparison)
NO stop comparisons for this pair of record

The following comparisons are used (in this order, chosen for performance reasons):

  1. Publication year: Are they at most 1 year apart?
    Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.
    Special cases: Cochrane Reviews are compared for the same publication year
  2. Starting page or DOI: Are they the same?
    If the starting pages are different or one or both are absent, the DOIs are compared.
    Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
    Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
    Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out.
    Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
    Special cases: For Cochrane Reviews DOIs are compared before starting pages.
  3. Authors: Is the Jaro-Winkler similarity of the authors > 0.67?
    Preprocessing: The author "Anonymous," and all Author Groups are skipped.
    Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." to "Moorthy, RK").
    Preprocessing: All authors from each record are joined by "; ".
    Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).
  4. Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.9?
    The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles.
    Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.
    Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
    Insufficient data: If one of the records is a reply (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).
  5. ISSN or Journal: Are they the same (ISSN) or similar (Journal)?
    The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs are treated as ISSNs. All ISSns and journal titles (including abbreviations) in the records are used.
    If the ISSns are different or one or both records have no ISSN, the journals are compared.
    Abbreviated and full journal titles are compared in a sensible way (see examples below).
    Preprocessing: ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
    Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
    Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

T3 field: Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

Reply: a publication is considered a reply if the title (field TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).

If two records get 5 YES answers, they are considered duplicates.

2. Enrich the deduplicated records
Only the first record of a set of duplicate records is copied to the output file.

When writing the output file, the following fields will be changed:

  • Author (AU):
    • if the (only) author is "Anonymous", the author is omitted
  • DOI (DO):
    • the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
    • DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
  • Publication year (PY):
    • if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
  • Starting page (SP) and Article Number (C7):
    • the article number is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
    • the article number field (C7) is omitted
    • if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
    • the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
    • if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
    • for Cochrane Reviews a missing review number ("CD...") is extracted from the DOI.
  • Title (TI):
    • If the publication is a reply, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")

The output file is a new RIS file which can be imported into a new EndNote database.

DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).

DedupEndNote has borrowed several ideas from: Yu Jiang, Can Lin, Weiyi Meng, Clement Yu, Aaron M. Cohen and Neil R. Smalheiser: Rule-based deduplication of article records from bibliographic databases, in: Database 2014, ID bat086, doi:10.1093/database/bat086

Mark mode

If you want to manually merge the records which are duplicates according to DedupEndNote, you can use Mark mode.

In Mark mode the ID of the first record of a set of duplicate records is copied to the label field ("LB") of all duplicate records. The input file is copied to the output file with the addition of the Label field if it is not empty. The original content of the Label field is overwritten! All other fields are copied as is (so no enriching of the output: prescribed form of DOI, ...).

After importing the results file into a new EndNote database, make the Label field visible. The IDs in the Label field refer to the IDs of the original EndNote database.

  • To see the duplicates, search for "Label" "is greater than" 0, and sort on the label field.
  • To manually merge, change in Preferences / Deduplicate the fields to deduplicate on to the Label field, deduplicate (in EndNote) and merge at will

Comparing the results of deduplication by EndNote itself and DedupEndNote:

  • Import the result file into a new EndNote database
  • Deduplicate by EndNote itself
  • Select all records in the "Duplicate References" set, and mark them as Read
  • Select the "All References" set, limit to the duplicates found by DedupEndNote (search for "Label" "is greater than" 0), and sort on the label field.
  • The records marked as Unread were identified as duplicates by DedupEndNote, but not by EndNote itself.
  • Select the "All References" set, limit to the duplicates not found by DedupEndNote (search for "Label" "is less than" 0).
  • The records marked as Read were identified as duplicates by EndNote itself, but not by DedupEndNote.

Examples of comparison

In the following table the results of EndNote's Find duplicates is compared to the comparisons in DedupEndNote. For these tests only one field was selected in EndNote in "Edit > Preferences > Duplicates", and "Ignore spacing and punctuation" was selected.

Field Examples EndNote finds duplicates DedupEndNote Score
Starting page and article number
  • ...
  • ...
??? ???
Title
  • 90Y radioembolization using resin microspheres in patients with hepatocellular carcinoma and portal vein thrombosis
  • 90Y RADIOEMBOLIZATION USING RESIN MICROSPHERES IN PATIENTS WITH HEPATOCELLULAR CARCINOMA AND PORTAL VEIN THROMBOSIS
Yes 1.00 == Yes
Title
  • Comments about Glisson's capsule phleboliths and portal vein thrombosis [1]
  • COMMENTS ABOUT GLISSON CAPSULE PHLEBOLITHS AND PORTAL-VEIN THROMBOSIS
No 0.92 == Yes
Title
  • Transarterial chemoembolization and <sup>90</sup>y radioembolization for hepatocellular carcinoma: Review of current applications beyond intermediate-stage disease
  • Transarterial Chemoembolization and Y-90 Radioembolization for Hepatocellular Carcinoma: Review of Current Applications Beyond Intermediate-Stage Disease
No 0.92 == Yes
Title
  • Epidemiology and diagnosis profile of digestive cancer in teaching hospital campus of lome: About 250 cases. [French]
  • Epidemiology and diagnosis profile of digestive cancer in teaching Hospital Campus of Lome: about 250 cases
No 0.99 == Yes
Title
  • Post Splenectomy Outcome in beta-Thalassemia
  • Post Splenectomy Outcome in β-Thalassemia
No 0.96 == Yes
Title
  • Letter: portal vein obstruction--which subset of patients could benefit the most? Authors' reply
  • Letter: Portal vein obstruction - Which subset of patients could benefit the most?
No 0.97 == Yes *
Title
  • Title: Some diseases associated with ulcero- hemorrhagic colitis: complication or coincidence. [French]
    Original Title: Quelques maladies associees a la colite ulcero- Hemorragique: Complications ou coincidences
  • Title: [Various diseases associated with ulcero-hemorrhagic colitis: complications or coincidences]
    Original Title: Quelques maladies associees a la colite ulcero-hemorragique: complications ou coincidences.
No 1.00 == Yes *
Title
  • Title: [HELLP in the second trimester in a patient with antiphospholipid syndrome]
    Original Title: HELLP kan ses i andet trimester ved antifosfolipidsyndrom.
  • Title: HELLP kan ses i andet trimester ved antifosfolipidsyndrom
No 1.00 == Yes *
Title
  • Title: NFkappaB inhibition decreases hepatocyte proliferation but does not alter apoptosis in obstructive jaundice
    Reversed Title: ecidnuaj evitcurtsbo ni sisotpopa retla ton seod tub noitarefilorp etycotapeh sesaerced noitibihni BappakFN
  • Title: NF kappa B inhibition decreases hepatocyte proliferation but does not alter apoptosis in obstructive jaundice
    Reversed title: ecidnuaj evitcurtsbo ni sisotpopa retla ton seod tub noitarefilorp etycotapeh sesaerced noitibihni B appak FN
No 1.00 == Yes *
Title
  • Title: Case report. Duplication of the portal vein: a rare congenital anomaly
    Reversed Title: ylamona latinegnoc erar a :niev latrop eht fo noitacilpuD .troper esaC
  • Title: Duplication of the portal vein - A rare congenital anomaly
    Reversed title: ylamona latinegnoc erar A - niev latrop eht fo noitacilpuD
No 0.96 == Yes *
Title
  • Title: La sémantique de l'image radiologique. Intérêt du procédé de soustraction électronique en couleurs d'Oosterkamp en angiographie abdominale
    Reversed Title: elanimodba eihpargoigna ne pmakretsoO'd srueluoc ne euqinortcelé noitcartsuos ed édécorp ud têrétnI .euqigoloidar egami'l ed euqitnamés aL
  • Title: INTERET DU PROCEDE DE SOUSTRACTION ELECTRONIQUE EN COULEURS D'OOSTERKAMP EN ANGIOGRAPHIE ABDOMINALE
    Reversed title: ELANIMODBA EIHPARGOIGNA NE PMAKRETSOO'D SRUELUOC NE EUQINORTCELE NOITCARTSUOS ED EDECORP UD TERETNI
No 0.91 == Yes *
Authors
  • Cobos Mateos, J. M.; Aguinaga Manzanos, M. V.; Casas Pinillos, M. S.; Gonzalez Conde, R.; Gonzalez Sanchez, J. A.; De Miguel Velasco, J. E.; Soleto Saez, E.; Suarez Mier, M. P.
  • Mateos, J. M. C.; Manzanos, M. V. A.; Pinillos, M. S. C.; Conde, R. G.; Sanchez, J. A. G.; Velasco, J. E. D.; Saez, E. S.; Mier, M. P. S.
No 0.75 == Yes
Authors
  • Danilă, M.; Sporea, I.; Popescu, A.; şirli, R.
  • Danila, M.; Sporea, I.; Popescu, A.; Sirli, R.
No 0.93 == Yes
Authors
  • Lv, Y.; Qi, X.; Xia, J.; Fan, D.; Han, G.
  • Lv, Y.; Qi, X. S.; Xia, J. L.; Fan, D. M.; Han, G. H.
No 0.90 == Yes
Authors
  • [empty]
  • Anonymous,
No Yes
Journal
  • British journal of surgery
  • Br J Surg
No Similar == Yes
Journal
  • European Journal of Gastroenterology and Hepatology
  • European Journal of Gastroenterology & Hepatology
No Similar == Yes
Journal + ISSN
  • Japanese Journal of Cancer and Chemotherapy [ISSN: 2690-2692]
  • Gan To Kagaku Ryoho [ISSN: 2690-2692]
No Similar == Yes
Journal
  • JAMA
  • Journal of the American Medical Association
No Similar == Yes
Journal
  • The Lancet Haematology
  • Lancet Haematol
No Similar == Yes
Journal
  • Hepatology
  • Hepatology International
No Similar == Yes *
Journal
  • AJR Am J Roentgenol
  • American Journal of Roentgenology
No Similar == Yes
Journal
  • British journal of surgery
  • Surgery
No NOT similar == No

*: In these cases the comparison of DedupEndNote for this content for this field is not accurate. However, the comparison of the other fields for these records does not result in YES answers, so the records are ultimately not considered duplicates.

Performance

See here for comparison with other tools and published test sets.

Own test database: portal vein thrombosis (52,828 records)

See Test results - details for a description of this test database.

Tool Setting Duplicates found Duplicates to delete After deduplication % kept
EndNote Author + Year + Title + Reference Type
(default setting)
32,891 19,959 32,869 62%
EndNote Author + Year + Title 32,920 19,976 32,852 62%
EndNote Author + Year + Title + Secondary Title (Journal) 22,120 12,333 40,495 77%
DedupEndNote 38,407 24,201 28,627 54%
Validation

A general subset of 4923 records was manually checked (150 records without an Author, 79 with Author "Anonymous", the 366 articles with "phase" in the title (because of "Phase I", Phase I/II", ...) the rest: "Aagaard" ... "Abbitt", "Anoob" ... "Axelrod", "Liu", "v Koppenfels" ... "von Woellwarth", and some others). Because replies are treated in a special way, 130 records with "reply" in the title which were identified as duplicates, were checked, but only for true and false positives.

Set True positive False negative Sensitivity True negative False positive Specificity
correctly identified duplicates missed duplicates correctly identified unique records incorrectly identified as duplicates
General 3681 275 93.0% 966 1 99.9%
Replies 130 0

The False Positives are all conference abstracts:

  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: THE ASSOCIATION BETWEEN PORTAL VEIN THROMBOSIS AND OTHER VENOUS THROMBOEMBOLISM IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT
    In: Gastroenterology 154 (6), 2018, p. S1178-S1179
  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT
    In: Gastroenterology 154 (6 Supplement 1), 2018, p. S-1178 DOI: 10.1016/s0016-5085(18)33901-5

Limitations

  • Input file size: The maximum size of the input file is limited to 150MB.
  • Input file format: The program only handles files in RIS format, not in XML or CSV format.
  • Input file encoding: The program assumes that the input file is encoded as UTF-8.
  • If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
  • (when deduplicating one file:) The input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. However, if the RIS file does not have an ID field in the first publication, DedupEndNote assumes the whole file has no ID fields and gives every publication an ID (starting with 1). This has only been tested on Zotero files!
  • The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). The data sets used were the results of biomedical queries. Deduplicating records from other databases is not guaranteed to work, and performance is often very poor, esp. for non journal articles (see Justification. 2.5).
  • The program uses a bibliographic point of view:
    • an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
    • Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
    • Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).

Issues and feature requests

If you have any questions about the tool or come across a problem when trying to use it, please raise an issue on the GitHub Repository.

How to cite

If you use this software, please cite it as:
Lobbestael, G. (2023). DedupEndNote (Version 1.0.0) [Computer software]. https://github.com/globbestael/DedupEndNote

Latest changes

  • 2024-01-14: 1.0.1: UG: oesn't handle author last name starting with ahyphen
  • 2023-09-02: ISBN-10 and ISBN-13 are compared correctly: 9024274214 = 9789024274215
  • 2023-09-02: BUG: Publication years of form "1-1-1989" (instead of "1989") are parsed as years
  • 2023-08-21: If a title with ": " has at least 50 characters before the ": ", that first part is also used as a title variant
  • 2023-08-21: BUG: Result file for deduplication of two files in Mark mode was not always written
  • 2023-08-18: "How to cite" added (CITATION.cff)
  • 2023-08-12: Files without IDs (record numbers) for each record (e.g. Zotero), get an ID assigned by DedupEndNote
  • 2023-08-12: Deduplication of Conference proceedings from Web of Science was very poor (too many false positives). This problem has been solved