Monday, May 7, 2012


I've moved to the SciAm network (yes!) where I'll be blogging in the Information Culture blog, together with excellent blogger Bonnie Swoger. As you can imagine, I'm thrilled, and I hope you'll be pleased with the new blog as I am. 


Tuesday, April 24, 2012

The post-journal era

Most of the scholarly publication today goes more or less like this: a scientist writes a manuscript about research funded by her university and/or the grant fairy (usually a government agency) then submits it to a commercial peer-review journal. An editor (either working for free or for "honorarium") reads her manuscript and sends it to appropriate peer reviewers (payment? what payment?). Then, if her manuscript is accepted, her institute's library gets the privilege of buying access to the published manuscript. This state of things is very profitable for the commercial publishers' stock owners, but less so for scientists, libraries and the general public, who rarely get to read research they paid for. While many people agree this system is, shall we say, less than optimal, attempts to remedy the situation have been less than successful, and the commercial publishers might be targeting our research budgets next.

The latest attempt to renovate the system is by Priem & Hemminger (2012). At the beginning of their paper, they suggest that previous attempts in reinventing scholarly publishing have failed due to two reasons:

1.Change to peer review are just patches on a fundamentally broken scholarly journal system
2.Proposals offer no smooth transition from the present system.

Today, the journal fills four main functions: It archives scholarly material and time-stamp the researchers' contributions, if disseminates scholarly products and it certifies contributions (if it's published in a high-impact journal it must be of value). Priem and Hemminger want to make each of these functions independent from the others.

Their first suggestion is to "refactor" the system. This means locating "parts which are confusing, inefficient or redundant" and improving them without hurting the rest of the system. Their second suggestion is the "decoupled journal" (DcJ) (more about this later).

Overlay journals
These journals were suggested by Ginsparg (1997) and only provide the "stamp of approval" to an already published-archived-registered material. Despite the promise the overlay model represents, it hasn't been successful so far, and almost every journal which tried it went back to the traditional coupled model.

The PLoS One model
PLoS One is an open access journal which publishes work not according to what the editors and reviewers consider significant, but consider only the paper's methodological quality. They decoupled the significant approval from the methodological approval. PLoS One also decoupled copy-editing: they warn in advance that they don't copy-edit in details, and instead provide a list of services which do just that. This model has proven to be profitable: PLoS One published more than 5,000 papers in 2010 at 1350$ each (and the other PLoS journals charge even more). The flaws here, beyond the price, are the exclusivity: authors publish only in one journal, and the danger of a future with only a few mega-journals.

Post-publication review services
There are  a few existing post-publication peer review services, the best-known of them are Faculty of 1000 (F1000) and Mathematical Reviews. F1000 "...identifies and evaluates the most important articles in biology and medical research publication." F1000 is supposed to function as additional help for researchers in managing their reading. It has actually been shown to identify quality papers which were overlooked by leading journals (Allen et al., 2009).

Mathematical Reviews is and abstracting service, but as Priem & Hemminger say, it is "occasionally called into service as a post-publication peer review venue when the traditional journal fail in their role as certifiers. In this case, abstracters may abandon objectivity and attack papers and their reviewers directly."

These services have one major problem: they aren't brand names, and can't replace the certification of well-established journals, no matter how much their peer review is sound.

The Deconstructed Journal
Smith (1999) had three insights about the Deconstructed Journal (DJ):
1. The means (journal) and the functions are not the same.
2. Any system that will be implanted instead of the journal has to be at least as good.
3. Several cooperating agencies could successfully replace the central publisher.

Priem and Hemminger cite van de Sompei et al. (2004) and Smith (2003) as those who pointed out the advantages of a deconstructed system:

"...encourages innovation, adapts well to changing scholarly practices, and democratizes the largely monopolized scholarly communication market" 
However, van de Sompeis' and Smith' proposals are a bit outdated, because they hadn't taken into account the social media.

The functions of the decoupled journal
The Decoupled journal (DcJ, rather than DJ) is the updated version of the DJ. This is a universal, or meta, journal, where everything scholars produce and share is stored long-term, added to other projects, linked to, commented about...etc. etc.With the DcJ, publication is the first step in the process of revisions, reviewes, etc. Scholarly items will need persistent IDs, storage, and mirror backup in order to survive long-term. This can be done with persistent identifiers such as the DOI and institutional or subject-area repositories (ArXiv, Pubmed).

After the publication of a draft, it's time for preparation. Preparation is defined by the authors as "Changing the format of a work to make it more suitable for a given (human or electronic) audience". Today, many companies sell authors services (like copy-editing), but preparation is still mostly left to the journal. The DcJ will allow authors the freedom to choose the preparation they prefer (say, PDF or HTML format). PLoS One, as mentioned before, already leaves copy-editing to authors, perhaps showing the beginning of a trend.

After the preparation comes the assessment. Defined as "Attaching an assessment of quality to a scholarly object". Today's method of assessment, peer review, is usually anonymous, unpublished to the general public, and done by invited reviewers. The reviewers give their opinion in free text first, then a final assessment whether the material should be published.

In the Priem & Hemminger model, reviewers don't decide whether the material is publishable or not  (it's already published!) but certificate it. In the future, Nature could become "Nature stamping agency" and give papers its "seal of approval". It will even be able to do so by giving grades, rather than just accept or reject the paper. There will be agencies that will only review the soundness of the work (like PLoS One does today), agencies who will certify only certain parts, open peer reviews and blind peer reviews. Other forms of assessments - blog posts, number of downloads, and even tweets - will be stored as well. The authors see the DcJ as a way to allow peer-review to evolve freely, without its tight coupling with the other functions of the journal.

With libraries' budgets tighter than ever (even Harvard decided that commercial journals are just too expensive ) I expect more and more authors will choose the DcJ route. However, it could be that a certification bottle-neck will be created, with the prestigious journals of today becoming the prestigious stamping agencies of tomorrow. The number of expert peer-reviewers in each field could become a limitation as well. Will our grandchildren complain about the amount of money they have to pay for a Science certification? Only time will tell.

Allen L, Jones C, Dolby K, Lynn D, & Walport M (2009). Looking for landmarks: the role of expert review and bibliometric analysis in evaluating scientific publication outputs. PloS one, 4 (6) PMID: 19536339

Ginsparg, P. (1997). Winners and Losers in the Global Research Village The Serials Librarian, 30 (3-4), 83-95 DOI: 10.1300/J123v30n03_13

Priem, J., & Hemminger, B. (2012). Decoupling the scholarly journal Frontiers in Computational Neuroscience, 6 DOI: 10.3389/fncom.2012.00019

Smith, J. (1999). The deconstructed journal – a new model for academic publishing Learned Publishing, 12 (2), 79-91 DOI: 10.1087/09531519950145896

Smith, J. W. T. (2003). The deconstructed journal revisited: a review of developments ICCC/IFIP Conference on Electronic Publishing-ElPub03: From information to knowledge. (Minho, Portugal).

Van de Sompel, H., Payette, S., Erickson, J., Lagoze, C., & Warner, S. (2004). Rethinking Scholarly Communication D-Lib Magazine, 10 (9) DOI: 10.1045/september2004-vandesompel

Friday, March 30, 2012

When prince charming kissed Mendel: delayed recognition in science.

Monk Gregor Mendel hadn't lived to see his peas become famous; his paper has been asleep, waiting for prince charming to cite it awake. Of course, not all "delay recognition" papers sleep as long as Mendel's, but "sleeping beauty" or "Mendel's syndrome" papers do exist in science. A "sleeping beauty" paper can go uncited for years, until suddenly it's awakened.

Costas, van Leeuwen and van Raan (2010) classify published scientific papers according to three general types:

Normal-type: these have the normal distribution of published papers, usually reaching the peak of their citation 3-4 years after publication and then decay.

Flash in the pans-type: these get cited very often when they first come out, but are forgotten in the long run, kind of like a teenager pop star.

Delayed-type: those who start drawing interest later than the normal-type papers. Costas et al. prefer not to call them all "sleeping beauties" because real sleeping beauties (never cited and then suddenly rise to fame) are very rare.

Source: Costas, van Leeuwen and van Raan (2010)

Looking at all the documents from Web of Science between the years 1980 and 2008 (over 30 million), Costas et al. found that the "flash in the pans" type of papers tend more to be editorial, notes, reviews and so forth, rather than research articles. Delayed documents tended to be more prominent in the "articles" category. When they checked Nature and Science, two 'letter' journals, Costas et al. found that they cover 10.9% and 10.5% of "flash in the pans" documents respectively, which is higher than average (9.8%) in the database.

The castle of the sleeping beauty is the availability of information. The information has to be accessible, and it has to be visible. The Web, of course, has improved the accessibility of papers a great deal, especially when said papers are open-sourced. When a paper is digitalized or becomes open-accessed, its visibility and availability increase. But being available is not enough: researchers must have use for the information despite the passage of time.

The prince kisses the sleeping beauty awake

Source: Wang, Ma, Chen & Rao, 2012

In 1995, Polchinski's paper on supergravity in string theory “Dirichlet branes and Ramond-Ramond charges” came out and cited an early work by Romans (1986) about the same subject. Romans' paper has not been cited from 1986 to 1995(!), but according to Google Scholar (which admittedly could be inflated) count, it has been cited 424 times since then. Why? One reason is that Romans' paper was simply ahead of its time, published in a "sleeping beauty" field. In the nine years until Polchinski's paper, interest in supergravity has considerably increased. Another reason is that Polchinski is a high-classed prince, with great academic authority. An unknown scholar probably wouldn't have been as successful in waking up Romans' paper.

Source: Wang, Ma, Chen & Rao, 2012

An extension of the "Mendel Syndrome" is "Mendelism", when researchers "develop lines of research and have a profile of publications (‘oeuvres’) 'ahead of their time'’’ (recent Nobel Laureate Dan Shechtman comes to mind). Citation speaking, the Mendel syndrome is defined by Costas et al. as "the undervaluation through citation analysis of units (individuals, teams, etc.) due to significant patterns of delayed reception of citations in their scientific publications."

Wang et al. returns to a question by Eugene Garfield, the father of citation indexing: "Would Mendel’s work have been ignored if the Science Citation Index was available 100 years ago?" We can only wonder.

Costas, van Leeuwen, & van Raan (2011). The ‘‘Mendel syndrome’’ in science: durability
of scientific literature and its effects on bibliometric
analysis of individual scientists Scientometrics, 177-205

van Raan, A. (2004). Sleeping Beauties in science Scientometrics, 59 (3), 467-472 DOI: 10.1023/B:SCIE.0000018543.82441.f1

Rodrigo Costas, Thed N. van Leeuwen, & Anthony F. J. van Raan (2009). Is scientific literature subject to a sell-by-date? A general
methodology to analyze the durability of scientific documents Journal of the American Society for Information Science and Technology arXiv: 0907.1455v1

Wang, Chen, & Rao (2012). Why and how can "sleeping beauties" be awakened? The Electronic Library, 30 (1), 5-18 :

Friday, December 30, 2011

Correlation between reference managers and the WoS

Even though web citations have been a part of our lives for several years now, the correlation between "traditional" citations and web resources like Mendeley, CiteULike, blog networks, etc. hasn't been thoroughly studied yet, and any new research in the field is very interesting (to me, anyway).

The new paper was published at Scientometrics by Li, Thelwall (still one of my dissertation advisors) and Giustini. They focused on the correlation between user count - the number of users who save a particular paper - and WoS and Google Scholar citations.

The researchers extracted from WoS all the Nature and Science research articles that were published in 2007 and their references. They ended up with 793 Nature and 820 Science articles, or 1,613 articles overall (not including references, of course). Then, they searched CiteULike for those articles' titles and number of citations, as well as for their user count in Mendeley. They also collected the same data from Google Scholar. It's important to note that Mendeley had 32.9 million articles indexed while CiteULike had only 3.5 at the time of the study.

Google Scholar's mean and median number of citations were higher than in WoS (not surprising; If you want better citation numbers, always use GS). They found that despite Mendeley being "younger" than CiteULike (launched in 2004 and 2008 respectively), CiteULike had only about two-thirds of the sample articles saved, while Mendeley had about 92%.

Spearman correlations between citations in GS and WoS were high in this research (0.957 for Nature and 0.931 for Science). The correlations between Mendeley's user count and the citations in GS and WoS were also rather good (0.559 and o.592 for WoS and GS respectively for Nature, 0.540 and 0.603 for Science). CiteULike had far weaker correlations: 0.366 with WoS and 0.396 with GS for Nature, 0.304 with WoS and 0.381 with GS for Science.


The authors remind us that correlation isn't causation, saying they can't conclude a casual relationship based on correlations between two data sources. Therefore, it can't be determined for sure whether there is a connection between a high user count and a high number of citations. Only Nature and Science were studied, so it can very well be that the results aren't true for other journals. Also, group-saved and single-user saved references were given the same weight. The number of saved references in Mendeley and CiteULike is much smaller than in the WoS counts and therefore the results might be less reliable.

The authors speculate that user count may represent a more accurate scientific impact of articles, and take note that one can measure the impact of all sorts of resources in online reference managers, unlike in the limited bibliographic indexes.

I think it could be reference managers don't always reflect readership: one could save a reference and forget about it all together later (so many articles, so little time...). On the other hand, citation counts might suffer from the same problem, as many scientists use a "rolling citation" from other articles citing an earlier article, without actually having read the article themselves.

Priem et al. also presented lately a study about web citations and WoS citations, based on data from the seven PLoS journals, but I think I'll wait for the journal article to cover it in the blog.

Li, X., Thelwall, M., & Giustini, D. (2011). Validating online reference managers for scholarly impact measurement Scientometrics DOI: 10.1007/s11192-011-0580-x

Wednesday, December 14, 2011

Reinventing Discovery, Part II

This is the second part of my review of Michael Nielsen's book "Reinventing Discovery - The New Era of Networked Science" (first part is here). Last time we talked about Galaxy Zoo, the Polymath Project, and why scientists don't (usually) do Wikis.  This time I'd like to focus on the book parts which talk about ArXiv

First of all, I have to say I've been using ArXiv extensively lately as part of the ACUMEN project, trying to figure out who and what can be found there. The place is a bit of a mess - it's not Pubmed - but it still left me in awe, because not only that most of the astronomers I've searched had papers there, most of them contributed at least one of the papers themselves (you can see who submitted the paper). 

ArXiv comes with a service called SPIRES (now inSPIRE) which can tell you how many times a paper was cited, who's citing who, and so forth. This way, it's possible to measure at least some of the impact of preprints (if you're a high-energy physicist). So, not only ArXiv makes the scientific communication faster, it also helps evaluate the impact of this kind of communication more accurately. 

Unfortunately, not everybody gives ArXiv the honor it deserves. Nielsen tells how when he was writing the book, a physicist told him that Paul Ginsparg, ArXiv's creator, was wasting his talent on "collecting garbage", reflecting a disregard certain scientists have for "mere" tool builders. I don't know if this attitude is common in the scientific community, but it's discouraging nonetheless. 

Open Access can be problematic 
Citizen science isn't always all that - in the Polymath Project, there were people with good intentions but not much knowledge, their contributions didn't have much value to the project and had to essentially filtered out. 

Misinformation - premature publications , especially in fields the mainstream media takes interest in, can spread far and wide, confuse the general public and discredit research projects in the eyes of the public. 

How we can be more open (if you're reading this, you probably don't need these suggestions). 
In the last few pages of the book, Nielsen suggests practical steps toward open science. A scientist can upload old data, code, etc. online for reuse (be sure to tell people how to cite it!); He/she can open a blog, contribute to other people's open science projects, or try to create a new one. Nielsen advises to "be generous in giving other scientists credit when they share their scientific knowledge in new ways" which I think is an excellent advice, even though the formatting and style guides are a bit behind the times when it comes to social media. 

All in all, Reinventing discovery is a great book, however, I was a little disappointed to find only a small section dedicated to science blogs. The author explains that he had enough of the hype around blogging and that he doesn't want "to cover that well-trodden ground again", but I think the book could have benefited from a few more pages about the subject (yes, I know I'm not very objective here...). Also, though the book deals with - and recommends - open access, it isn't under Creative Commons licence (you can read why here). 

Nielsen, Michael (2011). Reinventing Discovery Princeton University Press Other: 9780691148908

Wednesday, December 7, 2011

Reinventing Discovery: Book Review, Part I

In Arthur C. Clarke's story "Into the Comet" he describes a spaceship with a computer malfunction that dooms all abroad to eventual death by starvation/oxygen deprivation, whichever comes first. The solution is a device older than the computer: the abacus. The entire crew ran calculations on acabi, and they make their way out of the comet's nucleus successfully. That is an extreme example of citizen science (or oh-my-God-we're-all-going-to-die science) but it shows the principle, that collaboration by a large number of people can solve very complicated problems. Michael Nielsen's excellent book, 'Reinventing Discovery' tells us about many such examples, though in most of them participants have to do a lot more than just calculate without thinking.

Take 'Galaxy Zoo': volunteers can help classify galaxies (it turns out people do it faster and more accurately than a computer). It all began when one overworked grad student, Kevin Schawinski, wanted to prove that elliptical galaxies aren't always old, but had simply too many galaxies to go through in order to prove his theory. He and a post-doc, Chris Lintott, joined forces and opened a website which allowed anyone to come and classify galaxy photos. The project is an enormous success, with 22 scientific papers so far and the spin-offs Galaxy Zoo 2 and Galaxy Zoo:Hubble.

Another story Nielsen recounts is the story of the Polymath Project: Fields Medal recipient Tim Gowers posted a mathematical problem in his blog and asked for a collaborative efforts. Twenty-seven people wrote 800 comments and solved the problem within 37 days. Now there is a Polymath blog which keeps up the good work.

These projects were a success, but Nielsen also studies failed projects and the reasons for their failure. He argues (which I wholly agree!) that scientists are rewarded by writing as many good scientific papers as possible. Contributing to, say, Wikipedia, essentially takes away time from research and gives nothing in terms of academic reputation.

Galaxy Zoo is a success because it gives astronomers something to write about, and it's possible the Polymath project succeeds because it A. involves people with tenure and B. involves people who want to be noticed by people with tenure.

Personally, I think the solution to scientists' reluctance to cooperate in collaborative projects is simple: put them in a spaceship and tell them they won't be able to make it home until they collaborate. However, it is possible the oxygen run out while they'd argue about whose name gets to be first in the authors' list. Also, spaceships are very costly.

Next part: what Nielsen has to say about Arxiv and the future of open science.

Michael Nielsen talks Open Science in a TED event:

Nielsen, Michael (2011). Reinventing Discovery Princeton University Press Other: 9780691148908

Thursday, August 18, 2011

Generic drug trials: more transparency needed

The New York Times reported a couple of days ago that "Federal regulators and the generic drug industry are putting the final touches on an agreement that would help speed the approval of generic drugs in this country and increase inspections at foreign plants that export generic drugs and drug ingredients to the United States." The generic drug manufactures will pay an annual fee of 299$ million dollars, so that the FDA will be able to hire more reviewers and speed up approval of applications for marketing of generic drugs. The question is: what do we know about the generic drugs marketed today?

Van der Meesch et al. (2011) published in PLoS One a methodological systematic review about Bioequivalence trials which compared generic to brand-name drugs published between 2005 and 2008. They searched Medline for appropriate papers, as well as journals which regularly publish bioequivalence trials. Out of 134 papers that reported bioequivalence trials between brand-name drug and generic drug, 55 didn't include the reference drug name and were excluded. The final sample consisted of 79 papers which dealt with assessment of the bioequivalence of generic and brand-name drugs.

What do the FDA and the EuropeanMedicine Agency (EMA) demand from a generic drug?
The FDA wants to know three things:

Cmax - maximum plasma drug concentration
Tmax - time required to achieve a maximal concentration
AUC - total area under the plasma drug concentration-time curve

The 90% confidence intervals for the ratios (test:reference) have to be between 80% and 125%. The EMA wants to know only the Cmax and the AUC.

Experiments of bioequivalence are usually randomized crossover trials. They are conducted on healthy volunteers by administrating one dose of the drug. Seventy-three (92%) of the trials were indeed single-dose trials (6 (8%) were multiple-dose) and 89% of the single-dose trials reported bioequivalence. About a third didn't report CIs for all the FDA criteria, and 20% didn't report the required EMA criteria. Only 41% of the papers reported funding, 25% had private funding.

As always, the study has limitations: it included only papers from the years 2005-2008 and relied on FDA guidelines from 2003 and EMA guidelines from 2001 (updated 2008). It's also possible that they researchers' search in Pubmed didn't retrieved all the relevant papers.

In conclusion, there is a serious lack of available data about generic drugs. The authors point out that while 1,661 generic drugs were approved by the FDA during the study period, there weren't any data available about trials assessing generic drugs on the FDA and/or EMA sites. The authors also hypothesize that such a small percent (10%) of failed bioequivalence trials seem unlikely and suggested a possibility of publication bias.

van der Meersch, A., Dechartres, A., & Ravaud, P. (2011). Quality of Reporting of Bioequivalence Trials Comparing
Generic to Brand Name Drugs: A Methodological
Systematic Review PLoS One : 10.1371/journal.pone.0023611