DedupEndNote: Justification

1. General rule

The general rule is:

Comparison Result Action
1 ... 5 YES go to next comparison if present,
else mark the records as duplicates
(insufficient data for the comparison in one of the records)
NO stop comparisons for this pair of record
The part "(insufficient data for the comparison in one of the records) → goto next comparison ..." sets DedupEndNote apart from other deduplication programs.

Justification:

  • If bibliographic databases didn't miss some data: DOI https://doi.org/10.1007/5584_2016_118 has Pages information in PubMed, but not in EMBASE or Scopus
  • If EndNote import filters (EndNote provided, provided by producers of bibliographic databases, home made) always put comparable data in the same EndNote fields
  • If all bibliographic data of imported records were already available, e.g. not with PubMed ahead of print publications

then the absence of Starting Page / DOI / Authors / ... in an EndNote record would mean that the corresponding publication has no Starting Page / DOI / Authors / .... Alas, the world isn't perfect yet.

DedupEndNote therefore interprets a comparison with insufficient data from one of two records not as "NO" (i.e. these records are different), but as "UNKNOWN" (i.e. we can't tell (yet)), and continues with the other comparisons for these records.

As a consequence, fields which are useful in comparisons but are not always available (e.g. DOI), can be used.

NOTE: If both records have insufficient data for a comparison, DedupEndNote interprets the fields as different. If there is no alternative for the comparison (see below), then DedupEndNote stops the comparisons for this pair of records and considers them different publications. See 2.5. Effect of 'insufficient data from one of two records'.

2. Comparisons

2.1. The fields chosen

The EndNote fields used in the comparisons are:

EndNote field Content Treated as Used in comparison no.
PY Publication year Year 1
SP Pages Starting page 2
C7 Article number
DO DOI DOI
AU Authors Authors 3
TI Title Title 4
OP * Original title (when reference type is not Conference Proceedings)
ST Short title
T3 ** Conference title
Alternate journal title
Original title
SN ISSN or ISBN ISSN 5
T2 Journal title
Book title
Journal
J2 Alternate journal
OP * Conference title (when reference type is Conference Proceedings)
T3 ** Conference title
Alternate journal title
Original title

*: the field OP can be a title variant or a journal variant.

**: Conference titles in T3 are omitted.

Justification:

  • These fields (or combinations of fields) are present for most records. DOIs may be absent, but in comparison 2 (Starting Page OR DOI) most records provide data
  • These fields proved to be sufficient to get good deduplication results.
  • Author: Reducing first names to initials identifies more duplicates although cases where first and last names are mixed up ("Ashish Anil, Sule" vs "Sule, A. A.") are no longer identified as duplicates.
  • Journal: There is no standard (used) for journals. PubMed puts the journal abbreviation in the Journal field, most other databases use the full journal title. There is no agreement on the journal names / abbreviations: "Am J Roentgenol" vs "AJR Am J Roentgenol".
    DedupendNote treats the content of the journal fields as a set. Comparing the journal of two records comes down to "Is there a journal from the set of journals of record 1 sufficiently similar to a journal from the set of journals of record 2?"
  • Title: There is no standard (used) for non-English titles. Bibliographic databases put the English translation of the title in the Title field, and sometimes the original title in another field.
    The translation may be their own: the original title "Autogreffe de cellules souches hématopoïétiques périphériques dans le cadre du traitement d'hémopathies malignes. Partie I: patients." became:
    • [Autologous transplantation of peripheral blood hematopoietic stem cells in the treatment of hematological malignancies. I: patients] (PubMed)
    • Autologous transplantation of peripheral blood stem cells for haematological malignancies. Part I: patients (Scopus, Web of Science)

    In the test database of 52.000 records 11% of the records were of non-English origin

  • Starting Page: Bibliographic databases do not always use the same fields for comparable content (e.g. PubMed doesn't use the Article number field, and puts the article number in the Pages field, Web of Science uses the Article Number field, and puts the number of pages of such publication in the Pages field).
    While reading the EndNote file DedupEndNote first picks the Article number (field C7) as Starting page (if present), overwriting it with the Starting page (field SP) of the file if it contains a "-". In the absence of an Article number (C7), Starting page (SP) is taken as Starting page
2.2. The fields not chosen

Justification:

  • Reference type: Bibliographic databases don't agree on this. Publications in "Advances in Experimental Medicine and Biology" are Journal Article when imported from PubMed or EMBASE, but Serial when imported from Web of Science.
    The only exception to this rule is Conference proceedings: The OP field can be the title of a conference or the original (non-English) title
  • Volume and Issue: these fields were found to have no additional value for deduplication
  • Accession number: are tied to specific bibliographic databases, so of minimal value for deduplication
  • Type of Work: Not always provided by bibliographic databases, not standardized over bibliographic databases
  • Ending page: Not always provided by bibliographic databases. Comparison by only Starting page proved to be sufficient to get good deduplication results.
2.3. Similarity

DedupEndNote uses Jaro-Winkler Similarity instead of equality when comparing some fields (Authors, Titles, partly with Journals).

Justification:

  • (see Examples of comparisons on the home page)
  • Most bibliographic databases use UTF-8, some however still use the ASCII set (Web of Science) making comparisons by equality hopeless when diacritical signs or Greek letters are used in one but not both records
  • Bibliographic databases may add metadata (e.g. for Title: "[French]", "Authors reply", ...)
  • Bibliographic databases may use different rules for number of authors recorded (when to use "et al."). Are:
    • Albrecht, M. H.; Vogl, T. J.; Wichmann, J. L.; Martin, S. S.; Scholtz, J. E.; Fischer, S.; Hammerstingl, R. M.; Harth, M.; Nour-Eldin, N. A.; Thalhammer, A.; et al.,
    • Albrecht, M. H.; Vogl, T. J.; Wichmann, J. L.; Martin, S. S.; Scholtz, J. E.; Fischer, S.; Hammerstingl, R. M.; Harth, M.; Nour-Eldin, N. A.; Thalhammer, A.; Zangos, S.; Bauer, R. W.
    the same authors?
  • Is "NFκB" (publication form) "NFκB", "NFkappaB" or "NF kappa B"?
  • Bibliographic databases dont agree on the author name:
    • Is "Xingshun Qi" (publication form) "Qi, X." (PubMed) or "Qi, X. S." (Web of Science)?
    • Is "E. MorenoGonzález" (publication form) "MorenoGonzález, E." (Scopus), "Moreno González, E." (PubMed) or "Gonzalez, E. M." (Web of Science)?

Values used for Jaro-Winkler Similarity:

Field Case Threshold Explanation
Authors default (i.e. not a Reply) 0.67
Reply and sufficient Start Pages 0.75 When a Reply the titles are not compared
Reply and insufficient Start Pages 0.8 When a Reply the titles are not compared
Journals * default (i.e. not a Reply) 0.9
Reply 0.93
Title Sufficient Start Pages or DOIs 0.9
Insufficient Start Pages and DOIs 0.94
Journals: if the JWS between Journal titles is below the threshold, two comparisons of the following types are tried:
  • "Ann Fr Anesth Reanim" and "Ann... Fr... Anesth... Reanim..."
  • "BMJ" and "B... M... J..."

Why Jaro-Winkler Similarity (JWS) and not Levenshtein distance / ...?

  • JWS is always a value between 0 and 1 making it easier to choose a cut-off point for allowed similarity (e.g. similarity > 0.91)
  • JWS's property of putting a heavier penalty on differences at the beginning of strings proved very helpful in comparing Titles and Authors:
    • "..." and "... - Commentary" are more similar than if a less partial similarity measure was used
    • "..." and "Case report: ..." (when compared in their reversed form) are more similar than if a less partial similarity measure was used
    • some bibliographic databases list all authors, others limit them and add "et al."
  • JWS proved useful, so no other similarity measures were tested
2.4. The order of comparisons

Records are put in year sets based on the publication year. Records without publication year are put in a special year set YEAR_0.

When deduplicating 1 file, records are compared in descending order of pairs of year sets. The records in YEAR_0 are added to each of these pairs except for the ones which are already marked as duplicates. For an EndNote RIS file with records from 1889 to 2020:

  • YEAR 2020 + YEAR 2019 + YEAR 0: all records of YEAR 2020 are compared to all records of YEAR 2020 and YEAR 2019 and YEAR 0
  • YEAR 2019 + YEAR 2018 + YEAR 0: all records of YEAR 2019 are compared to all records of YEAR 2019 and YEAR 2018 and YEAR 0
  • ...
  • YEAR 1889 + YEAR 0: all records of YEAR 1889 are compared to all records of YEAR 1889 and YEAR 0

Justification:

  • Pairs of year sets:
    • Bibliographic databases sometimes don't agree on the publication year, in most cases the difference is only 1 year
    • The publication year of Ahead of print publications and of their final form can be more than 1 year apart, but in the majority of cases they are at most 1 year apart.
    • Extending to groupings of more years would take longer and could cause more false positives
  • Descending order: Because the first record of a set of duplicate records is saved to the output file,
    • when encountering an ahead of print publication of e.g. 2018 and a corresponding final record of 2019, the 2019 record will be saved
    • when encountering a record without a publication year and a corresponding record with a publication year, the record with the publication year will be saved

When deduplicating 2 files, records of both files are compared in ascending order of pairs of year sets. The YEAR_0 is added to each of these pairs, but records which are marked as duplicates are first removed.
The records of the OLD file are read before the records of the NEW file; because the duplicate chosen in a set of duplicate records is the first one encountered, duplicate records from the OLD file will be chosen when present.
For an EndNote RIS file with records from 1889 to 2020:

  • YEAR 0 + YEAR 1889 + YEAR 1890: all records of YEAR 0 + YEAR 1889 are compared to all records of YEAR 0 and YEAR 1889 and YEAR 1890
  • YEAR 0 + YEAR 1890 + YEAR 1891: all records of YEAR 0 + YEAR 1890 are compared to all records of YEAR 0 and YEAR 1890 and YEAR 1891
  • ...
  • YEAR 0 + YEAR 2020: all records of YEAR 0 + YEAR 2020 are compared to all records of YEAR 0 and YEAR 2020

The output file only contains records from the NEW file which are not duplicates of records of the OLD file, and (if there are duplicates within the NEW file) are the first duplicate encountered within that duplicate set.

Justification:

  • Ascending order:
    • when encountering an ahead of print publication of e.g. 2018 in the OLD file and a corresponding final record of 2018 or 2019 in the NEW file, the NEW record will be seen as a duplicate of the OLD record and not be saved. Saving that record would create a duplicate
    • when encountering a record without a publication year in the OLD file and a corresponding record with a publication year in the NEW file, the record with the publication year will be seen as a duplicate of the OLD record and not be saved. Saving that record would create a duplicate
2.5. Effect of 'insufficient data from one of two records'

The 5th comparison (ISSN or Journal: Are they the same (ISSN) or similar (Journal)?) looks at first sight to only compare journal articles (with the additional effect that no publications of another type can ever be a duplicate). This is not completely true:

  • ISBN is treated the same way as a ISSN
  • The EndNote fields T2, J2, OP and T3 can also be used with other publications types

However: the general rule treats 2 records / field sets as different if both records have insufficient data for that comparison. Two book records with the same authors, publication year and book title will be considered duplicates only if both have the same / a similar ISBN.

Relaxing this general rule (so that comparisons with insufficient data in one or both records are treated the same way: UNKNOWN, so go on to the next comparison) would result in a lot more False Positives.

3. Enriching the duplicate chosen

Justification:

  • (all cases): the enriched data are copied from existing duplicate records or occur in a lot of records (empty author, full pages form, omitted same ending page)
  • Author "Anonymous": There is no reason to use 2 forms (Author "Anonymous" and no Author) for this case
  • DOI - adding from other duplicates: Considered a useful addition
  • DOI - standardized form: Considered a useful addition (clickable in EndNote)
  • Publication year - missing added from other duplicates: Records without publication year are subpar
  • Starting page and Article number:
    • Some bibliographic databases (e.g. PubMed) treat them the same way, others (e.g. Scopus, Web of Science) don't
    • EndNote output formats (e.g. Vancouver style) can't handle Article numbers
  • Starting page - missing added from other duplicates: Records without starting page are subpar
  • Starting page - standardized to full form:
    • EndNote deduplication treats "492-5" and "492-495" as different values
    • EndNote output formats (e.g. Vancouver style) handle this full form gracefully (emit "492-5" for pages "492-495")
  • Starting page - omitting same end page:
    • Some bibliographic databases (esp. Web of Science) sometimes use this form (e.g. "211-211")
    • EndNote deduplication treats "211" and "211-211" as different values
    • EndNote output formats (e.g. Vancouver style) emit "211-" for pages "211-211"
  • Title - Reply: Inconsistent use of bibliographic databases. The longest title holds most information