DedupEndNote: FAQ

Home
  1. Why only EndNote export files?
  2. I don't use EndNote, but ...
  3. I use groups, comments, ... in EndNote and don't want to lose them
  4. I have deleted records in EndNote and don't want them to reappear when updating
  5. I prefer PubMed records above EMBASE records
  6. Is DedupEndNote perfect?

1. Why only EndNote export files?

EndNote is very good at importing export files from many bibliographic databases. The export from an EndNote database has a standard format, which makes reading and interpreting records for DedupEndNote a lot easier.
(This version of) DedupEndNote does not read EndNote XML exports, because reading and writing XML files is more complex than reading and writing a RIS file, without offering any benefit.

2. I don't use EndNote, but ...

Sorry, the program has been developed and tested with EndNote export files in RIS format.
However, the code is available on GitHub. For other databases the main changes will probably have to be made only in the functions for reading and writing.

3. I use groups, comments, ... in EndNote and don't want to lose them

EndNote export files in RIS format do not contain grouping information!

If you have groupings, ... in your EndNote database before you started using DedupEndNote: use DedupEndNote in MarkMode, and deduplicate then within EndNote only on the Label field.

TODO: EXPAND

TODO: XML (formatting)

However, for new projects the following will work: using a MASTER Database and a WORKING Database
(The names of the EndNote databases and the RIS files are just examples)

  1. create a MASTER database
  2. import your first set of records from one or more bibliographic databases into the MASTER Database
  3. (optionally: clean up the MASTER Database) if the MASTER Database can contain duplicate records, export the MASTER Database as a RIS file, deduplicate the file with DedupEndNote, import the deduplicated results into a NEW_MASTER Database EndNote file, remove the old MASTER Database, and make a copy of the NEW_MASTER database as your MASTER database (File > Save a Copy ...)
  4. Make a copy of the MASTER Database as your WORKING Database EndNote file (File > Save a Copy ...)
  5. edit, comment, delete, group, ... records only in the WORKING Database!

When you have new export files from the same or other bibliographic databases:

  1. import these results into a new TEMP Endnote database, export that TEMP EndNote database into a RIS file NEW_RECORDS.txt (overwriting the file if it already exists), and remove the TEMP EndNote database
  2. export the MASTER Database into a RIS file OLD_RECORDS.txt (overwriting the file if it already exists)
  3. use DedupEndNote - deduplicate 2 files with these two files (results will be in NEW_RECORDS_deduplicated.txt)
  4. import NEW_RECORDS_deduplicated.txt into the MASTER Database
  5. import NEW_RECORDS_deduplicated.txt into the WORKING Database.
    And again: edit, comment, delete, group, ... records only in the WORKING Database!
    • because the deleted records from the WORKING Database are still present in the MASTER Database (and as a consequence also in OLD_RECORDS.txt), duplicates for these records in NEW_RECORDS.txt will be recognized by DedupEndNote and not reappear in NEW_RECORDS_deduplicated.txt
    • the existing groupings, edits, ... in the WORKING Database will not be overwritten.

    TODO: What about manual adds?

    TODO: What about updates of ahead-of-print publications?

4. I have deleted records in EndNote and don't want them to reappear when updating

See previous question

5. I prefer PubMed records above EMBASE records

DedupEndNote compares records in the order of the input file, after they are grouped on publication year. This makes it possible to influence the choice of duplicate within a duplicate set.

Suppose all records have the field Database Provider filled in with either "PubMed" or "EMBASE". If you order the orginal EndNote database on the field Database Provider in descending order before exporting to a RIS file, DedupEndNote will use a PubMed record if a duplicate set contains one or more PubMed records and one or more EMBASE records (except if one or more EMBASE records of this duplicate set has a publication year one year later than all PubMed records). TODO: Explain why "later"?

Your preference will probably not be the same as the alphabetical order of the names of database providers ("CINAHL", "Cochrane ...", "EMBASE", ...). You could however change this field in "1 - PubMed", "2 - EMBASE", "3 - Cochrane", ... and order your EndNote database on this field (EndNote will order numerically) before exporting to a RIS file, or use another field with similar content.

6. Is DedupEndNote perfect?

Of course not.

However, DedupEndNote tries to avoid at all costs to wrongly identify records as duplicates. The MarkMode was specifically developed for this, making comparison between results of deduplication by EndNote and by DedupEndNote feasible.

False positives

There was only 1 false positive found in the validation sets (11,474 validated records):

  • Cytology screening (from SRA-DM, 1856 records, all validated)
  • Haematology (from SRA-DM, 1415 records, all validated)
  • Respiratory (from SRA-DM, 1988 records, all validated)
  • Stroke (from SRA-DM, 1292 records, all validated)
  • BIG SET (our own test database of 52,828 records, with 4923 records validated)
  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: The Association Between Portal Vein Thrombosis and Other Venous Thromboembolism in Cirrhosis: Analysis of a Nationally Representative Inpatient Cohort, in: Gastroenterology 154 (6 Supplement 1), 2018, S-1179
  • Cool, J.; Rosenblatt, R.; Kumar, S.; Lucero, C.; Fortune, B.; Crawford, C. V.; Jesudian, A.: TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT, in: Gastroenterology 154 (6 Supplement 1), 2018, S-1178
False negatives

DedupEndNote will miss some duplicate records, e.g.:

  1. Records have different starting pages and not 2 DOIs which are the same
  2. The publication years of duplicate records can be more than 1 year apart. The validated subset of BIG_SET (4923 records) has 60 examples of duplicate records with different publication years: differences are 1 year (48), 2 years (8) and 4 years (4).
  3. Journals titles are not identified as similar (e.g. "Jbr-btr" and "Journal Belge de Radiology" with different or no ISSNs, or "Am J Dig Dis" and "Aaierj.Dio.Dis." (from EMBASE, definitely a typo), or "J Can Assoc Radiol" and "Canadian Association of Radiologists Journal")
  4. Replies: Identifying replies by "response" as one of the title words is not accurate enough (e.g. "Regulation of hepatic blood flow: The hepatic arterial buffer response revisited") and could wrongly identify duplicates. As a consequence "Incidence and Natural Course of Portal Vein Thrombosis in Cirrhosis Response" (Web of Science) is not marked as a duplicate of "Response to Senzolo and García-Pagán" (PubMed, EMBASE and Scopus) (https://doi.org/10.1038/ajg.2013.298).
  5. Retracted publications: The title of these publications in Web of Science look like "RETRACTED: Isolated central retinal artery occlusion as an initial presentation of paroxysmal nocturnal hemoglobinuria and successful long-term prevention of systemic thrombosis with eculizumab (Retracted article. See vol. 58, pg. 307, 2014)". Other databases use "Isolated central retinal artery occlusion as an initial presentation of paroxysmal nocturnal hemoglobinuria and successful long-term prevention of systemic thrombosis with eculizumab" and use another field to mark the retraction.
    The Jaro-Winkler Similarity of these 2 titles is 0.77 (in straight as well as reversed order) which is too low to be considered a duplicate.

After deduplication by DedupEndNote, deduplication with EndNote itself (default setting: Author, Year, Title) could be worthwhile. The case of different starting pages can only be solved by looking up the publication itself, the case of the missed similar journals by your brain.

After deduplicating our test database of 52,828 records with DedupEndNote and importing the result file into a new EndNote database, deduplication with EndNote itself (default setting: Author, Year, Title) found 260 duplicates. After manual inspection, we found 118 records which were true duplicates. Some of the "reasons":

  • different starting page: 25
  • different starting page and DOI: 1
  • different journals: 7, i.e. DedupEndNote couldn't be expected to recognize these journals as similar
  • error in starting page: 1
  • Database error in Journal name: 12
    Scopus sometimes uses the continuation title too early:
    • "Blood Vessels" (PubMed) vs "Journal of Vascular Research" (Scopus, continuation of "Blood Vessels")
    • "Acta Pathol Jpn" (PubMed, EMBASE) vs "Pathology International" (Scopus, continuation of "Acta Pathol Jpn")
    • "Acta Paediatr Jpn" (PubMed) and "Pediatrics International" (Scopus, continuation of "Acta Paediatr Jpn")
    • "American Journal of Pediatric Hematology/Oncology" (PubMed) and "Journal of pediatric hematology/oncology" (Cochrane, Scopus, continuation of "American Journal of Pediatric Hematology/Oncology")
    • "Z Kinderchir" (PubMed) and "European Journal of Pediatric Surgery" (Scopus, continuation of "Z Kinderchir")
    • "Cardiovasc Surg" (PubMed) and "Vascular" (Scopus, continuation of "Cardiovasc Surg")
    • "Int J Addict" (PubMed) and "Substance Use and Misuse" (Scopus, continuation of "Int J Addict")
    • "Aust Paediatr J" (PubMed) and "Journal of Paediatrics and Child Health" (Scopus, continuation of "Aust Paediatr J")
  • DedupEndNote error in Journal name: 3, i.e. DedupEndNote would be expected to recognize these journals as similar
    • ("Jbr-btr" vs "Journal Belge de Radiologie")
    • "Langenbecks Arch Chir Suppl Kongressbd" vs "Langenbecks archiv fur chirurgie. Supplement. Kongressband. Deutsche gesellschaft fur chirurgie. Kongress"
    • "J Vasc Surg Venous Lymphat Disord" vs "Journal of Vascular Surgery: Venous and Lymphatic Disorders"
  • DedupEndNote parsing error in Starting page: 1 ("S6-97-s6-99" vs "S697-s699", DedupEndNote used "6" in the first case instead of "697")

Some cases of duplicate records are "unsolvable" for programs? Take e.g.the following publication (https://www.nejm.org/doi/full/10.1056/NEJM199105303242207):

  1. Cabot RC, Scully RE, Mark EJ, McNeely WF, McNeely BU, Podolsky DK, Lewandrowski KB. Case 22-1991: A 15-Year-Old Boy with Fever of Unknown Origin, Severe Anemia, and Portal-Vein Thrombosis. New England Journal of Medicine. 1991;324(22):1575-84.
  2. Podolsky DK, Lewandrowski KB. Case records of the Massachusetts General Hospital. New England Journal of Medicine. 1991;324(22):1575-84.
  3. Podolsky DK, Ferrucci JT, Ellis DS, Mark EJ, Pasternack MS, Huang PL, Lewandrowski KB, Dowling WJ, Goldfinger SE. A 15-YEAR-OLD BOY WITH FEVER OF UNKNOWN ORIGIN, SEVERE ANEMIA, AND PORTAL-VEIN THROMBOSIS - APPENDICITIS WITH PERIAPPENDICITIS AND FOREIGN-BODY GIANT-CELL REACTION CONSISTENT WITH PRIOR RUPTURE - PYLEPHLEBITIS WITH THROMBOSIS, ACUTE AND CHRONIC TRIADITIS, AND CHOLANGITIS. New England Journal of Medicine. 1991;324(22):1575-84.
  4. Case records of the Massachusetts General Hospital. Weekly clinicopathological exercises. Case 22-1991. A 15-year-old boy with fever of unknown origin, severe anemia, and portal-vein thrombosis. N Engl J Med. 1991;324(22):1575-84.

"Cabot RC, Scully RE, Mark EJ, McNeely WF, McNeely BU" in (1) were the (associate) editors of the NEJM in 1991, "Podolsky DK, Lewandrowski KB" in (2) described the case, all authors in (3) discussed the case.