Linear Regression Analysis of Title Word Count and Article Time Cited using R

Mohebbi and Douzandegan: Linear Regression Analysis of Title Word Count and Article Time Cited using R

Authors

INTRODUCTION

Effects of article title length on article impact are controversial. Studies have shown that an article title length may have a positive, negative, our neutral influence on articles citations.[1,2,3] However, many factors may affect study’s outcome. Importantly, sample size, statistical methods, journals or topics by which articles are retrieved, and time-spans are major factors. For example, popular journals with more attentions may cause a bias toward those in which articles title are in a pre-defined format by the journal.[1]

Studies about this subject had different results due to different data selection, statistical analysis strategies. In a study on more than 9000 article from 22 different journals, authors conclude that articles in journals with higher impact factors tend to have large word counts in title and get more citations.[5] In later study, authors have chosen more variables from titles of 423 Articles divide into two separate results-describing and methods-describing titles groups. With different statistical analysis as well as logistic regression they have shown that titles with less characters would bring more citations.[12] Results similar to those of Paiva et al were obtained in a study of seven journals from PloS publication. It was found that though each journal had different scope short titles had more downloads and citations.[2]

These studies clearly show that design of study may lead to either positive or negative correlation between title size and citations. Therefore to minimize bias due to different data sources, in this paper we have selected a dataset from uniform topic and research area within a certain time period.

R is a free statistical tool with over 2,000 cutting-edge, user-contributed packages available on CRAN. Additionally, we preferred R to other statistical tools because of, in addition to its availability, accessibility to routinely updated advanced packages incorporating recent developments in mathematics making it a comprehensive tool to carry out different types of data analysis, use of data presentation packages, and it’s capability to incorporate and analyzing various types of data formats.

As a result, it was found that article word length has a potential impact on article citation. In addition, it was concluded that, along with title size, other scientometrics variables would have influence on article citations.

MATERIALS AND METHODS

Data Retrieve and Article Title Word/Character Counting

One thousand scientific article records in virology research area (SU=Virology) were retrieved from Web of Science InCite database from 1997 to 2016 as on September 27, 2016. After that, data was merged into CSV file format using Publish or Perish 4.0 software package (Harzing, A.W. 2007, http://www.harzing.com/pop.htm). Microsoft (MS) Excel formulae were used for data manipulation and title word counting.

Articles database was cleaned for any duplication and articles with missing data on any of the selected variables mentioned in 2.2 were deleted.

Variable Selection

Information on the following variables was then tabulated from the above articles database:

Article title word count (TWC), year of publication (YoP), publishers, and journal sources (JS) in which article is published and number of article citations (TC)retrieved from Science In Cite as on September 27, 2016. Journals (JS) were grouped into high impact or low impact factor using Web of Science Journal Citation Reports®. Data mining was performed in less than one hour.

Statistical Analysis

We have selected chosen high enough subjects per variable (SPV) to prevent R2 biases.2 R statistic software was used for data analysis.[4] Packages used were : lmtest, rms, Hmisc, and ggplot2 was used for diagnosing the heteroscedasticity of the regression model, correct standard errors, correlation, and drawing scatter plot.[6],[7],[8-13]

RESULTS

After data retrieval and trimming, only 99,838 articles were left with desired information. As shown in Figure 1, articles were from 37 publishers and 56 journals. Based on sources and publishers, mean of citations and total number of articles are shown in Tables 1 and 2. Sum of citations in 99,838 articles was 2,542,056. Figure 1a shows American Social Microbiology with largest number of papers to have most citations (40.55 ± 49.950) and TWCs with subjects AIDS Research and Human Retroviruses having largest total citations in years between s1997-2016. Linear regression resulted in a model with a prediction ability of 13.68% (y-intercept = 62.325, slope = 0.545, adjusted R-squared = 0.1368, and p-value <2.2e-16). All Predictors had significant same <2.2e-16 p-value in regression modeling, but with TWC p-value equal to 1.12e-06. This model with three TWC, YoP, and Publisher predictors was better than that models based on each variable alone. Adjusted R-squared with the Year predictor was 0.1228, 0 with the Journal Source variable, 0.021 with the Publisher predictor, and 0 with TWC. However, predictors TWC (p-value = 0.012), Year (p-value: <2.2e-16), and Publisher (p-value <2.2e-16) have a synergic potential to prediction of TC (Adj.R2=0.1368). By removing Journal Source from the model, the power of prediction has changed inconsiderably (Adj.R2= 0.1357, p-value <2.2e-16). Figure 3 shows diagnostic plot of the predicted model. In addition, there was no multi-colinearity in any of predictors. Moreover, heteroscedasticity was evaluated to check for hetero dispersion within variable if any using studentized Breusch-Pagan test. Result showed existence of non-acceptable heteroscedasticity (BP = 527.89 [df: 4], p-value <2.2e-16). Standard errors were corrected to take care of this. The correction was found to change R-squared value to 0.306 (y-intercept: 4.19, S.E:0.02, p-value <0.0001).

Figure 4 shows observed vs. predicted values. Based on journal source, R2 was significantly higher but only for those journals in with few articles. Predicting equation model obtained with these journal sources were not able to predict observed citations (data is not shown). We hypothesized that significant negative correlation between citations and Publisher could be because of inclusion of large number of journals with low impact factors (IF) in the dataset. To answer this question, data was split into two categories, one with four quartiles (Cat. 1-4) of journals with IF articles with less than 1.5, 1.501-2.45, 2.451-4.1 and another of journals with IF greater than 4.11. Moreover, data were spilled uniformly in each IF category based on the source of publication. As it is illustrated in Figure 2, high impact journals have more citations as expected. In contrast in my hypothesis, large portions of articles were in to categories with more IF.

Table 1

Information of data based on publishers.

PublisherCitations (Mean±S.D)TWC (Mean±S.D)YearaNumber of ArticlesR2 (p-value)b
ACADEMIC PRESS26.08± 33.27516.10± 5.2651997-201691800 (0.1)
AEPRESS2.13± 3.62315.08± 4.3441998-20152560.006 (0.1)
AMER SOC MICROBIOLOGY40.55± 49.95016.73± 5.3711997-2016256990 (0.5)
ANNUAL REVIEWS3.57± 4.2338.32± 3.3692014-2015560 (0.8)
AOSIS OPEN JOURNALS0 .43± 1.04212.65± 5.1172008-2015370 (0.5)
BENTHAM SCIENCE PUBL LTD7.84± 10.57413.70± 13.702003-20166490.015 (0.01)
BIOMED CENTRAL LTD11.39± 15.28015.54± 5.0472004-201638340.004 (0.0001)
BLACKWELL26.40± 28.36516.18± 5.3141997-20087920 (0.3)
CELL PRESS50.21± 63.04913.79± 3.4632007-20168760 (0.8)
EDITIONS SCIENTIFIQUES MEDICALES ELSE10.72± 10.79213.25± 4.1491997-1998850 (0.7)
ELSEVIER15.23± 21.27415.29± 5.1021997-2016150010.005 (0.0001)
FUTURE MEDICINE LTD1.73± 3.35611.60± 4.2672006-20162750.003 (0.3)
GUSTAV FISCHER VERLAG9.17± 13.64413.84± 5.7681997-20002640 (0.4)
HINDAWI PUBLISHING CORP0 .24±0.43714.47± 2.3482015-2016170 (0.6)
INDIAN VIROLOGICAL SOC1.97± 2.20115.15± 4.4262005-2010790 (0.7)
INT MEDICAL PRESS17.28± 21.24415.72± 4.4931997-201517870 (0.6)
JOHN WILEY & SONS LTD27.64± 38.1368.77± 4.8101997-2009470.010 (0.2)
KARGER11.77± 23.21314.96± 5.6541997-20159710 (0.4)
KLUWER ACADEMIC12.63± 14.90814.85± 5.3891998-20045460 (0.6)
LIPPINCOTT WILLIAMS & WILKINS37.40± 52.75214.75± 4.2051997-201662240 (0.4)
MARY ANN LIEBERT14.12± 19.25016.80± 5.1661997-201645670 (0.9)
MDPI AG3.62± 5.29314.56± 4.8832005-20166830.009 (0.008)
MICROBIOLOGY SOC3.50± 3.53618.50± 7.7782013, 20152-
NATURE PUBLISHING GROUP27.69± 24.45513.53± 5.1932000-20011270.005 (0.2)
NEW YORK ACAD SCIENCES39.59± 33.43111.15± 4.2582001390.13 (0.014)
PLENUM PRESS DIV PLENUM PUBLISHING CO9.66± 16.86413.28± 5.10219981060.017 (0.1)
PUBLIC LIBRARY SCIENCE31.42± 38.97814.19± 4.1392005-201649690.006 (0.0001)
RAPID SCIENCE PUBLISHERS52.66± 58.82014.46± 5.27719972150 (0.7)
SA HIV CLINICIANS SOC1.43± 5.22212.12± 5.0692007-20141650 (0.9)
SLOVAK ACADEMIC PRESS LTD6.89± 7.47714.97± 4.7701997-20115300.004 (0.08)
SOC GENERAL MICROBIOLOGY25.72± 31.64316.29± 5.3481997-201667620.001 (0.017)
SPRINGER10.95± 21.79715.21± 4.9821997-201672340 (0.5)
STOCKTON PRESS25.49± 36.05414.39± 5.8661997-20001830.007 (0.1)
TAYLOR & FRANCIS INC21.55± 26.84015.62± 5.4882001-20105060 (0.6)
URBAN & FISCHER VERLAG26.14± 31.25113.55± 5.5632000-20052880.014 (0.024)
WILEY-BLACKWELL16.21± 22.81915.94± 4.8361997-201667750.001 (0.005)
WORLD HEALTH ORGANIZATION0 .000 .000201312-
Total25.46±38.00615.77±5.150-99838 

a Year in which data are published.

b Adjusted-R2 of linear regression analysis of TC and TWC based on publisher.

The correlation between TC and other parameters was investigated. Results have shown negative correlations for TC and Yop (−0.35, p=0.0001), Publisher (−0.14, p=0.0001), a positive correlation with TWC (0.01, p=0.0121), and no correlation with journal source (0, p>0.05) (Figure 5).

DISCUSSION AND CONCLUSION

A p-value less than 0.05 is considered sufficient for assigning a variable into a predicting linear model. Linear regression results obtained here also indicate effect of TWC on response variable, TC. However in this paper we have examined in detail if TWC-based linear model for predicting response variable TC is reliable or not.

We have conducted a linear regression analysis on a database containing Virological papers. Interestingly, using TWC variable, we found that in case of low TC in sets of data containing small number of articles, a linear model can be assigned (Table 2). However, results do not show a reliable linear model for prediction of TC irrespective of number of articles and high TC . It is likely that in articles that receive higher number of citations, readers pay attention to many more variables than simply TWC, making it harder to model a regression.

Having checked relationship between TWC and TC, to show no linear relations (only 30.6% predicting ability with standard error corrections) we have then incorporated, in addition to TWC (article word size), YoP (year of publications), and JS (journal source) and searched f for a meaningful predictors of TC (article time cited). We find that TC is negatively correlated to YoP and JS (Publisher,) and positively with TWC (P<0.05). Negative correlation of JS and TC, is shown in Figure 2, thus TC of articles in high impact factors journals during the years 1997-2016 are less predictable.

We note that Scientometric and Bibliometrics studies-employ varied ways of data collection and analysis. However, a scientific paper also has descriptive and reflective contents.[1] Falahati et al, have observed that title length and subject of article are both relevant to article citations, but they did not find correlation between title length and citations,[3] implicating other factors from bibliometrics materials may be involved. Article citation may be influenced by research area, topics, words size, characters, punctuations etc. Also some topics, in a certain time period may attractmore interest than other subjects. Therefore analysis based on, different time period segments may minimize biases either in variety of published articles or time variable itself. For this, other methods or data retrieve strategies need to be taken.

Table 2

Information of data based on publishers

Journal SourceCitations (Mean±S.D)TWC (Mean±S.D)YearaNumber of ArticleR2 (p-value)b
ACTA VIROLOGICA5.34± 6.85015.01± 4.6331997-20157860.005 (0.031)
ADVANCES IN VIROLOGY0 .24± 0.43714.47± 2.3482015-2016170 (0.6)
ADVANCES IN VIRUS RESEARCH0.007.0020001-
AIDS38.20± 53.32814.70± 4.2091997-201663450 (0.2)
AIDS RESEARCH AND HUMAN RETROVIRUSES15.12± 20.42516.95± 5.2181997-201636800 (0.8)
ANNUAL REVIEW OF VIROLOGY3.57±4.2338.32±3.3692014-2015560 (0.8)
ANTIVIRAL CHEMISTRY & CHEMOTHERAPY15.57±17.72914.83±5.1561997-20011840.001 (0.3)
ANTIVIRAL CHEMISTRY & CHEMOTHERAPY CLINICAL A6.96±9.9986.78±3.3841999230.084 (0.1)
ANTIVIRAL RESEARCH17.57±22.75515.23±4.8261997-201620270.009 (0.0001)
ANTIVIRAL THERAPY17.47±21.60715.82±4.4011998-201516030 (0.8)
ARCHIVES OF VIROLOGY12.70±24.84315.45±5.0531998-201548940 (0.2)
BIOLOGY OF EMERGING VIRUSES: SARS, AVIAN28.20±24.9179.20±5.5342007100.372 (0.036)
BULLETIN DE L INSTITUT PASTEUR3.00±4.35910.67±2.5171997-199830 (0.9)
CELL HOST & MICROBE50.21±63.04913.79±3.4632007-20168760 (0.8)
CLINICAL AND DIAGNOSTIC VIROLOGY21.46±20.87715.01±4.9301997-1998700.005 (0.2)
CORONAVIRUSES AND ARTERIVIRUSES9.66±16.86413.28±5.10219981060.017 (0.1)
CURRENT HIV RESEARCH7.84±10.57413.70±4.7762003-20166490.015 (0.001)
CURRENT OPINION IN VIROLOGY13.79±17.0298.89±3.1912011-20165130.003 (0.1)
FOOD AND ENVIRONMENTAL VIROLOGY5.48±9.02314.08±4.5892009-20162170.015 (0.039)
FUTURE VIROLOGY1.73±3.35611.60±4.2672006-20162750.003 (0.2)
GASTROENTERITIS VIRUSES25.00±28.1546.29±3.3122001170 (0.9)
HIV INTERACTIONS WITH DENDRITIC CELLS: INFECT6.70±4.8558.80±2.0442013100 (0.7)
INDIAN JOURNAL OF VIROLOGY1.70±1.91415.69±4.6642009-20131620 (1)
INFLUENZA AND OTHER RESPIRATORY VIRUSES6.58±11.46415.31±4.8642009-20167420 (0.5)
INTERNATIONAL JOURNAL OF MEDICAL MICROBIOLOGY16.32±22.87014.52±5.0672000-201610560.013 (0.0001)
INTERVIROLOGY11.69±23.54315.11±5.6391997-20159340 (0.4)
JAAGSIEKTE SHEEP RETROVIRUS AND LUNG CANCER32.50±12.5819.75±4.950200380.474 (0.035)
JOURNAL OF CLINICAL VIROLOGY16.17±23.98315.22±5.1931998-201630870.008 (0.0001)
JOURNAL OF GENERAL VIROLOGY25.72±31.64016.29±5.3481997-201667640.01 (0.017)
JOURNAL OF HUMAN VIROLOGY18.31±17.14917.06±5.7401999-2002940 (0.5)
JOURNAL OF MEDICAL VIROLOGY18.58±24.73215.83±4.8421997-201649990.001 (0.007)
JOURNAL OF NEUROVIROLOGY18.48±27.58414.79±5.4191997-201612220.001 (0.2)
JOURNAL OF VIRAL HEPATITIS18.04±23.27716.67±4.9231997-201618100.002 (0.033)
JOURNAL OF VIROLOGICAL METHODS14.33±20.02516.10±4.8071997-201647910.001 (0.016)
JOURNAL OF VIROLOGY40.55±49.95016.73±5.3711997-2016256990 (0.5)
NIDOVIRUSES (CORONAVIRUSES AND ARTERIVIRUSES)3.92±5.61113.53±4.64320011020 (0.6)
NIDOVIRUSES: TOWARD CONTROL OF SARS AND OTHER3.74±3.72610.67±3.83620061110 (0.5)
PLOS PATHOGENS31.42±38.97814.19±4.1392005-201649690.006 (0.0001)
POLYOMAVIRUSES AND HUMAN DISEASES25.96±25.7408.38±3.4872006240.078 (0.1)
RESEARCH IN VIROLOGY11.45±12.72713.07±4.2581997-1998890 (0.4)
RESPIRATORY VIROLOGY AND IMMUNOGENICITY0.88±0.99112.50±3.625201580.535 (0.024)
RETROVIROLOGY16.40±19.99315.34±5.0492005-201610610.004 (0.029)
REVIEWS IN MEDICAL VIROLOGY25.00±40.54110.14±4.6731997-2013170.037 (0.1)
SEMINARS IN VIROLOGY22.00±17.2978.19±3.7311997-1998260.195 (0.014)
SIMIAN VIRUS 40 (SV40): POSSIBLE HUMAN POLYOM13.92±12.20311.30±4.7951998370.012 (0.2)
SOUTHERN AFRICAN JOURNAL OF HIV MEDICINE1.25±4.75312.22±5.0692007-20152020 (0.9)
VIRAL IMMUNOLOGY9.95±12.47916.18±4.8971998-20168870 (0.3)
VIROLOGICA SINICA0.54±0.69114.68±2.8972015-2016370.063 (0.07)
VIROLOGY26.10±33.31016.12±5.2511997-201691530 (0.1)
VIROLOGY JOURNAL9.47±12.51915.61±5.0462004-201627730.003 (0.002)
VIRUS GENES9.28±11.53815.39±4.7421998-201618570 (0.2)
VIRUS RESEARCH14.95±20.69615.30±5.1711997-201637380.008 (0.0001)
VIRUSES-BASEL3.62±5.29314.56±4.8832009-20166830.009 (0.008)
WEST NILE VIRUS: DETECTION, SURVEILLANCE, AND39.59±33.43111.15±4.2582001390.13 (0.014)
WHO EXPERT CONSULTATION ON RABIES: SECOND REP0.00±0.0004.75±1.815201312-
ZENTRALBLATT FUR BAKTERIOLOGIE-INTERNATIONAL9.17±13.64413.84±5.7681997-20002640 (0.4)
Total25.46±38.00615.77±5.150-99838 

a Year in which data are published.

b Adjusted-R2 of linear regression analysis of TC and TWC based on publisher.

Figure 1

Schematic representation of TWC and TC based on publishers and journal source. a) Shows total TC in each publisher, b) illustrates TWCs within publishers, c) is TC related to journal sources, and d) demonstrated number of TWC used in each journal source.

https://s3-us-west-2.amazonaws.com/jourdata/jscires/10.5530jscires.6.1.3-g001.jpg
Figure 2

Journal Sources with their respect Scientometrics information’s. Journal sources in Cat1 comprised of TC, number of articles and TWC mean of 6.5, 424.67, and 13.59, respectively. In Cat 2, data have changed to 10.45 for TC, 1907.36 for number of articles, and 14.68 for TWC. Mean of citations in Cat 3 and 4 was 17.67 and 27.19, 2843.70 and 4596.20 for mean number of articles, 15.30 and 13.55 for TWC, respectively.

https://s3-us-west-2.amazonaws.com/jourdata/jscires/10.5530jscires.6.1.3-g002.jpg
Figure 3

Diagnostic plot of adjusted linear model. Upper left plot shows residuals versus fitted values; data are in the regression line as red line is laid on the dotted line. Normal Q-Q plot shows residual are normally distributed; outline data with higher citations are shown as well. Lower left plot is used to measure square root of standardize residuals against the fitted value; as the red line is flat, it is assumed that the variance of residuals does not change the distribution. Fourth plot shows how each data point influences the regression; as it is shown, outliers- high leverage and large residuals data (point over 0.5 Cook’s distances)- may affect linear regression fit, as the red line is not leaving dashed line, it indicate good regression fit.

https://s3-us-west-2.amazonaws.com/jourdata/jscires/10.5530jscires.6.1.3-g003.jpg
Figure 4

1/100 random sample predicted by adjusted, standard error corrected linear model.

https://s3-us-west-2.amazonaws.com/jourdata/jscires/10.5530jscires.6.1.3-g004.jpg
Figure 5

Scatter plot matrix of data variables correlation.

https://s3-us-west-2.amazonaws.com/jourdata/jscires/10.5530jscires.6.1.3-g005.jpg

In scientifc view, number of times an article cites is major impact of the article. Therefore, fnding factors infuenc-ing article citation needs further research in the future. Accordingly, those factors with high impact on article time cited can be used for reconstruction of statistical predictive model(s).

Notes

[5] Conflicts of interest CONFLICT OF INTEREST The authors have no conflicts of interest to declare.

[6] Financial disclosure FUNDING SOURCES This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

ACKNOWLEDGMENT

We thank Doctor Mohammad Ali Vakili for unsparing assistance in statistical concepts.

ABBREVIATION USED

TWC

Title Word Count

YoP

Year of Publication

JS

Journal Source

TC

Time Cited

SU

Subject Area

CSV

Comma Separated Value

MS

Microsoft

SPV

Subject Per Variable

IF

Impact Factor

REFERENCES

1. 

Aleixandre-Benavent R, Montalt-Resurecció V, Valderrama-Zurián JC , authors. A descriptive study of inaccuracy in article titles on bibliometrics published in biomedical journals. Scientometrics. 2014. 101(1):p. 781–91. https://doi.org/10.1007/s11192-014-1296-5.

2. 

Austin PC, Steyerberg EW , authors. The number of subjects per variable required in linear regression analyses. Journal of Clinical Epidemiology. 2015. 68(6):p. 627–36. https://doi.org/10.1016/j.jclinepi.2014.12.014. [25704724.s]

3. 

Falahati Qadimi Fumani MR, Goltaji M, Parto P , authors. The impact of title length and punctuation marks on article citations. Annals of Library and Information Studies (ALIS). 2015;s62(3):126–32

4. 

Fox J, Weisberg S, Adler D, Bates D, Baud-Bovy G, Ellison S, Firth D, Friendly M, Gorjanc G, Graves S, Heiberger R , authors. R. Package ‘car’.2016.

5. 

Habibzadeh F, Yadollahie M , authors. Are shorter article titles more attractive for citations? Crosssectional study of 22 scientific journals. Croatian Medical Journal. 2010. 51(2):p. 165–70. https://doi.org/10.3325/cmj.2010.51.165. [20401960]PMC2859422

6. 

Harrell FE Jr , author. Hmisc: Harrell miscellaneous. R package version 3.12-2. Computer software]. Available from http://cran. R-project.Org/web/packages/Hmisc. 2013

7. 

Harrell FE Jr , author. rms: Regression Modeling Strategies. R package version 4.0-0. City 2013.

8. 

Hothorn T, Zeileis A, Fare brother RW, Cummins C, Millo G, Mitchell D, Zeileis MA , authors. Package ‘lmtest’. Testing linear regression models. [Accessed. 2015 Jun 6;6];https://cran. R-project. Org/web/packages/lmtest/lmtest.pdf.

9. 

Jacques TS, Sebire NJ , authors. The impact of article titles on citation hits: an analysis of general and specialist medical journals. JRSM Short Reports. 2010. 1(1):p. 2https://doi.org/10.1258/shorts.2009.100020. [21103094]PMC2984326

10. 

Jamali HR, Nikzad M , authors. Article title type and its relation with the number of downloads and citations. Scientometrics. 2011. 88(2):p. 653–61. https://doi.org/10.1007/s11192-011-0412-z.

11. 

Letchford A, Moat HS, Preis T , authors. The advantage of short paper titles. Royal Society Open Science. 2015. 2(8):p. 150266https://doi.org/10.1098/rsos.150266. [26361556]PMC4555861

12. 

Paiva CE, Lima JP, Paiva BS , authors. Articles with short titles describing the results are cited more often. Clinics. 2012. 67(5):p. 509–13. https://doi.org/10.6061/clinics/2012(05)17.

13. 

Wickham H , author. ggplot2: elegant graphics for data analysis. 2009. 1(2):p. 3New York: Springer. https://doi.org/10.1007/978-0-387-98141-3.