| CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 8
Standardize publisher names #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use a slug reductionism to eliminate casing or punctuation variants of the same publisher. Add additional publisher name patches as per greenelab/scihub-manuscript#24
|
For convenience, I've converted
@tamunro would love your eye here. Most of the Sci-Hub analyses will have to be rerun with this new dataset, so now is the time to identify any issues. Would like to get this wrapped up in the next day or two. |
|
I had a quick run over it, clustering with OpenRefine to give "Clustered publishers". I added an "edited" column (Y or N), currently filtered to the ones I changed: Some of these involve subjective decisions, as you can see. Obviously, there will be many other redundancies, but it seems like a very low fraction. I removed "Springer" where it had been inserted. |
|
Thanks @tamunro! This is tremendously helpful. Based on this contribution and your previous feedback, we'll make sure to include you as a coauthor on the next version of the manuscript (if this is something you are interested in). I'll get to work incorporating these patches. |
Ah I understand now. They replaced |
See #3 (comment) Added rows from publishers-clustered.xlsx where Edited==Y to publisher-name-patches.tsv. See https://github.com/dhimmel/scopus/files/1438905/publishers-clustered.xlsx Manual viewing of `git diff --color-words data/publisher-name-patches.tsv` confirms patch veracity.
Update Scopus to dhimmel/scopus@2129f07 Includes publisher name fixes as per: greenelab/scihub-manuscript#24 dhimmel/scopus#3 The number of articles attributable to journals has increased.
|
Thanks very much for the authorship offer! That's very generous. Don't finalize it yet - I left a more exhaustive clustering running overnight, and there are some more I'll fix today. I did try to report the errors to Scopus, but their online help is dire, and I never got a reply. The only time I've heard from them is when I reported missing content on Sciencedirect, and my report got sent to Scopus by mistake. So I pointed out the mistake, and they sent that to Scopus by mistake too. Then I gave up. |
Okay, I did make downstream updates in greenelab/scihub@f8531be, but will rerun things to incorporate additional publisher corrections. Note that most differences in greenelab/scihub@f8531be are from other Scopus improvements (like better ISSN mappings) and not just from the publisher patches. |
|
Here's a greatly expanded version: publishers-clustered 2017-11-3.xlsx It turned out to be a shockingly dirty dataset for publishers with few journals. The cleaning could go on forever. So it's probably best to hedge any conclusions about those ones. |
|
@tamunro added these additional patches in 5017121. Thanks a lot!
Yeah its a mess, but I think our patches will have fixed most of the really atrocious errors.
Sure... it'd be great if they fixed these issues upstream. Unfortunately I don't know there GitHub handle, but you should feel free to inform them.
Nothing is wanted in return and this repo is of the public domain. They should use it to improve their database's quality. |
|
Given how dirty the Scopus names are, another possibility that occurred to me would be to take them from crossref. Alternatively, their title list has 55,000 serials with the publisher names and ISSNs, but not all the DOI prefixes. I presume these are the current publishers, not the registrants. Either way, from searching the lists, they're clearly vastly higher-quality than Scopus's. |
If we were to redo things, I'd probably switch all journal metadata to Crossref and entirely forgo Scopus. In the past, this was not possible due to CrossRef/rest-api-doc#179. I think we want to hold off on re-architecting the Sci-Hub analysis as much as possible, given that the project is nearing completion. The point about registrant versus current publisher is interesting. For the Sci-Hub coverage analyses, I'm not sure which one is more appropriate. Probably current publisher. Since there's been lot's of consolidation, if we took registration publisher, then there'd be lot's of publishers that have now been subsumed by a bigger one. How did you find out about ftp://ftp.crossref.org/titlelist/titleFile.csv? I'm curious whether Crossref's FTP site hosts other useful files. |
|
I just found it on their website, googling for something. I know nothing about their ftp, I'm afraid. Maybe one of the developers could point you at some hidden extras. |
DOI to journal mappings as well as the journal catalog changed, resulting in the new statistics reported here. Includes publisher name patches. See dhimmel/scopus#3.
DOI to journal mappings as well as the journal catalog changed, resulting in the new statistics reported here. Includes publisher name patches. See dhimmel/scopus#3.
DOI to journal mappings as well as the journal catalog changed, resulting in the new statistics reported here. Includes publisher name patches. See dhimmel/scopus#3. Changes to the source analyses from these updates are available at greenelab/scihub@f8531be
Use a slug reductionism to eliminate casing or punctuation variants of the same publisher.
Add additional publisher name patches as per greenelab/scihub-manuscript#24