Mapping World-class universities on the Web

Jose Luis Ortega, Isidro Aguillo

Cybermetrics Lab, IEDCYT-CSIC, Joaquín Costa, 22. 28002 Madrid. Spain
jortega(at)orgc.csic.es; isidro.aguillo(at)cchs.csic.es

Published at Information Processing & Management, 2009. Vol 2 (March), 45: 272-279

aaaiAbstract
aaaaaA visual display of the most important universities in the world is the aim of this paper. It shows the topological characteristics and describes the relationships among universities of different countries and continents. The first 1 000 higher education institutions from the Ranking Web of World Universities were selected and their link relationships were obtained from Yahoo! Search. Network graphs and geographical maps were built from the search engine data. Social Network Analysis techniques were used to analyse and describe the structural properties of the whole network and its nodes. The results show that the world-class university network is constituted from national sub-networks that merge in a central core where the principal universities of each country pull their networks toward international link relationships. The United States dominates the world network, within Europe stands out the British and the German sub-networks.

aaaiIntroduction
aaaaaThe World Wide Web has become a key medium in order to promote and develop the academic, scientific and educational competences of a university. E-learning programs and open access initiatives allow knowledge of these institutions to spread beyond physical boundaries. The Web can hence be used as a way to attract students, scholars and funding from other places, spreading the prestige of these educational institutions all over the world. This has provoked fierce competition between universities to achieve an advantageous visibility on the Web and to improve their position in search engine results.
aaaaaWeb performance has been analysed from different points of view. Web data have been used as an indicator of the educational and scientific activity developed on the Web, relating web indicators with academic outputs (Thelwall, 2002b; Thelwall & Harries, 2003; 2004; Smith, 2008) or bibliometric indicators (Aguillo, Granadino, Ortega & Prieto, 2006). Visualization of Information (Chen, 2003) has also been a suitable tool for mapping university linkages and showing visual relationships according to several variables. The first attempts used multivariate analysis to plot and group universities (Polanco, Boudourides, Besagni & Roche, 2001; Vaughan, 2006). Now, Network Analysis offers additional structural and visual possibilities. Heimeriks & van den Besselaar (2006) used these analysis techniques to detect four geographical zones in the European Union (EU) academic web space: Scandinavian, UK, German and South Europe. Similar results were obtained by Ortega et al. (2008), finding that European universities are grouped in local or national sub-networks which are connected with other sub-networks for linguistic or geographical reasons (Thelwall, 2002a; Thelwall, Tang & Price, 2003). Lately, Thelwall and Zuccala (2008, in press) have studied the link relationship between universities and national web spaces in Europe, describing the European web relationships at the country level.
aaaaaAll these studies were focused on countries such as Spain (Thelwall & Aguillo, 2003; Ortega & Aguillo, 2007) Canada (Vaughan & Thelwall, 2005; Vaughan, 2006) or regions such as the EU (Heimeriks & van den Besselaar, 2006; Ortega et al., 2008; Thelwall & Zuccala, 2008, in press) or Scandinavia (Ortega & Aguillo, 2008a). However, studying the performance of universities at a global level and with a large and consistent population has not been attempted.

aaaiObjectives
aaaaaThe purpose of this paper is to present a visual display of the 1,000 most important universities in the World according to the Ranking Web of World Universities (www.webometrics.info). This map intends to show the topological characteristics of this network and to describe the relationships among universities of different countries and continents. We also present, through network analysis techniques, the most important universities in the network structure, the gateway universities that connect different web spaces or sub-networks and the network core.

aaaiMethods

aaaaaData extraction
aaaaaWe have selected the first 1 000 higher education institutions from the Ranking Web of World Universities. This ranking orders the universities according to three main web characteristics from their institutional web domain. The volume of contents is measured by the number of pages freely accessible, their visibility by the number of incoming links. The number of rich files is used as an indicator because rich files are a format to spread scientific and technical data and results. Together, these indicators make possible to describe the performance of these academic institutions on the Web, being a complement to other educational and scientific rankings. The main search engines (Google, Yahoo! Search, Live Search and Exalead) are used to implement this ranking (Aguillo, Ortega & Fernandez, 2008).
aaaaa 1 000 institutions were selected because a digital divide was perceived between North American universities and the rest of the World. If we observe the top 200 list, we detect 59.5% of North American universities and 40.36% in the top 500 list (Aguillo, Ortega & Fernandez, 2008). So, we have decided to take a wide sample that represents more continents.
aaaaaA link matrix between this set of universities was built, extracting the data from Yahoo! Search in February 2008. Yahoo! Search was used because it allows several search operators and the web coverage is rather wide. The following queries were used to obtain links from the university domain (A) to the university domain (B) and vice versa:

site:{university domain (A)} linkdomain:{university domain (B)}

and to obtain the total number of pages indexed in the university domain (A):

site: {university domain (A)}

A SQL routine was used to submit the 1 001 000 needed queries to built the link matrix.

aaaaaGeographical Map
aaaaaWe have built a geographical map in order to show the distribution of pages and link flows at the level of countries. To design a geographical map we need a base map which contains the political boundaries of the World. This base map was downloaded from Blue Marble Geographics web site (www.bluemarblegeo.com). Then, we used the Geographical Information System (GIS) software MapViewer 6 to build the final map. This map has two layers: a hutch map which represents the number of web pages by country and a flow map which shows the links between countries. The classification method used in both layers was Jenks’ natural breaks (Jenks, 1963). This method determines the best arrangement of values into classes by iteratively comparing sums of the squared difference between observed values within each class and class means. This method improves the visualization and the interpretation of the results, because it creates more significant differences between classes.

aaaaaNetwork Graph
aaaaaA network graph was build with the in-links between the 1 000 university web domains. Several variables have been used in order to add information about the network configuration. Nodes size shows the volume of web pages that each university publishes on the Web, colours represent the nationality of each high education organization and arc size shows the frequency of links between two university domains.
aaaaa The software used to visualise the network was Pajek 1.02. We selected a cut-off of minimum 50 links to improve the network visualization. Also we used the Fruchterman-Reingold algorithm to lay out the network because it is the fast for large networks (de Nooy, Mrvar & Batagelj, 2005).
aaaaa Several social network indicators were used to describe the network topology and the main characteristics of the nodes:

  • K-Core: a sub-network in which each node has at least degree k. K-Cores allow us to detect groups with a strong link density. In free-scale networks, i.e. the Web, the core with the highest degree is the central core of the network, detecting the set of nodes the network rests on (Seidman, 1983). >
  • Degree: the number of lines connecting a node. This can be normalized (nDegree) by the total number of nodes in the network. In a directed network such as the Web we can count only the incoming links (InDegree) or the outgoing links (OutDegree). In Webometrics, InDegree allows us to detect the visibililty of a web domain (Cothey, 2005; Kretschmer & Kretschmer, 2006).
  • Betweenness: the capacity of one node to help connect those nodes that are not directly connected to each other. Its normalization is the percentage over the total number of nodes in the network. From a webometric point of view, this measure allows us to detect hubs or gateways that connect different web networks (Faba-Pérez, Zapico-Alonso, Guerrero-Bote & Moya-Anegón, 2005).

aaaiResults

aaaaaDescriptive analysis
aaaaa Prior to the link analysis we made a frequency distribution by country of the 1 000 universities.

Countries Universities %
United States 369 36.9
United Kingdom 68 6.8
Germany 66 6.6
France 50 5
Spain 41 4.1
Canada 39 3.9
Japan 35 3.5
Italy 34 3.4
Australia 30 3
China 17 1.7
Taiwan 17 1.7
Sweden 15 1.5
Brazil 14 1.4
The Netherlands 13 1.3
Finland 12 1.2
Rest of the World 180 18
TOTAL 1 000 100
Table 1. Universities distribution by country (15 first)


aaaaaTable 1 shows the number of universities by country, listing only the first 15 countries. The United States (US) universities are 36.9% of the entire sample, trailed by the United Kingdom (UK) (6.8%) and Germany (6.6%). This distribution is also observed in the Top 200 of the ranking which suggests that there is a digital divide in favour of US universities. The low performance of emerging countries like Russia (0.6%) and India (0.4%) is also clear.

aaaaaGeographical Map

World Geographical Map
Figure 1. Geographical map of the distribution of pages by country and their link flows

aaaaaFigure 1 shows the geographical distribution of web pages by country and the incoming and outgoing links among these countries. Two regions stand out for their large amount of web pages: North America (USA and Canada) and the European Union (EU) zone. The USA is the country with most web pages (50.57%), holding half of the world academic web pages indexed in Yahoo! Search. It is followed by Germany (7.14%) and the UK (4.28%) in the EU. Besides these zones, notice the web development of Japan (2.35%), Australia (2.35%) and China (2.33%) in the East and Brazil (.94%) in South America. Contrarily, two zones have no universities in the sample: Africa (with the exception of South Africa) and the Middle East (with the exception of Israel and Arabia Saudi).
aaaaa From the US position, the upper loops show the outgoing links and the lower loops the incoming ones. The most important link flows are between North American countries and EU countries, while in a second ring are links between East Asian and Oceanic countries and the US.

aaaaaNetwork Graph

aaaaaThe World class network (Figure 2) shows small-world properties because its clustering coefficient (C=527.25) is considerable higher than the same for a random network (C= 35.14) (Watts & Strogatz, 1998). Furthermore, its average path length (l=2.26) is also rather low. Visually, small-world properties can be seen through the traversal links that run across the network, connecting distant clusters (Figure 2). The in and out degree frequency distributions follow a power law trend (γin=.81; γout=.73) which allows us to state that this network owns scale-free properties as well (Barabasi, Albert & Jeong, 2000).

World Network graph

Figure 2. Network graph of the World class universities on the Web (N=1 000 arcs≥ 50 links)

aaaaaFigure 2 shows the graph of the 1 000 higher education institutions. First, each university is linked with the universities of its own country. Thus, we can visually detect homogeneous national groups such as Germany (red), the UK (light green) or Japan (orange). However, we can also see that there are countries that do not constitute a compact group such as France (dark blue), Canada (white) and other countries with a small set of universities such as the Netherlands (dark red). This may be due to some countries are included in other larger national sub-networks, indeed Canada is related to the US and the Netherlands with the UK. This describes a cumulative process in which each national sub-network is aggregated to other one like an accreation model.
aaaaa The graph also shows linguistic (Thelwall, Tang & Price, 2003) and geographical relationships (Thelwall, 2002). The European countries are located on the right side of the picture, while the left side is mainly taken up by Asian and American ones. It shows, for example, that Spanish universities are between the European and the Latin-American ones, relating linguistic aspect with geographical proximity. In a similar way, Australia is located between the USA and the UK.
aaaaaObserve that size is related to link attraction, because the large universities are located in the core of the network. Nevertheless, some countries, specifically Asian ones (China, Japan and Taiwan), have large universities that are far from the core. This may be caused by low development of English pages by these countries (Vaughan & Thelwall, 2004).

Network graph detail

Figure 3. Detailed view of the central core of the network

aaaaaThe main core of the World network was detected with the k-cores method. The central core is 116 nodes with degree 93. This highly connected cluster has 98 American universities. The rest are from Canada (7) and Europe (11). Figure 3 shows in detail this central core, highlighting universities like Harvard, Stanford or Massachusetts Institute of Technology (MIT) which are located in the centre of the graph and attract a huge amount of links from the entire network. Next, the important European universities in the core of the network pull their national networks, as with Cambridge of the British network, Trier of the German one or the Swiss Federal Institute of Technology Zurich (ETHZ) of Switzerland. However, despite the closeness of the Australian universities (purple), there is no presence of Asian, African and Latin-American universities, with the exception of the Israeli ones which are located around the Unites States sub-network.
aaaaaWe also calculated the in- and out- degree of each university and ranked it. United States universities are the most interconnected in the network. MIT (78.1) and the universities of Berkeley (73.5) and Stanford (73.1) are the web domains most linked in the network (Table 2). Contrarily, the universities that keep the network more connected, making outgoing links, are US as well, particulary the universities of Wisconsin-Madison (47), Stanford (41.8) and Florida (41.2) (Table 3). Notice that both tables only include US universities and the first European universities in the indegree rank are Cambridge in 18th and Leeds in 19th. In the outdegree, the first are ETHZ in 15th and the University of Amsterdam in 22nd.

University Domain InDegree nInDeg
Massachusetts Institute of Technology mit.edu 781 78.1
University of California, Berkeley berkeley.edu 735 73.5
Stanford University stanford.edu 731 73.1
University of Illinois at Urbana-Champaign uiuc.edu 666 66.6
Harvard University harvard.edu 634 63.4
University of Michigan umich.edu 634 63.4
University of Wisconsin-Madison wisc.edu 629 62.9
University of Texas at Austin utexas.edu 589 58.9
Cornell University cornell.edu 557 55.7
University of Washington washington.edu 555 55.5

Table 2. First 10 universities by their InDegree

University Domain OutDegree nOutDegree
University of Wisconsin-Madison wisc.edu 470 47
Stanford University stanford.edu 418 41.8
University of Florida ufl.edu 412 41.2
University of California, Berkeley berkeley.edu 411 41.1
University of Washington washington.edu 390 39
Massachusetts Institute of Technology mit.edu 378 37.8
University of Illinois at Urbana-Champaign uiuc.edu 369 36.9
Carnegie Mellon University cmu.edu 365 36.5
University of Pennsylvania upenn.edu 360 36
Harvard University harvard.edu 356 35.6

Table 3. First 10 universities by their OutDegree


aaaaaAs above, the World network is the aggregated union of national sub-networks. The betweenness centrality index detects the gateway universities that connect these national sub-networks with the remaining ones. Table 4 shows the principal universities in each country according to the betweenness centrality. We can appreciate outstanding universities in each country such as MIT in the US, Cambridge in the UK or ETHZ in Switzerland. Thus, these universities connect local web spaces with international ones. However, there are no German or Spanish universities in the top positions, although both countries have a good position in the network. We suggest that as there is a linguistic factor in the relationships between countries, the German-speaking network is represented by ETHZ and the Spanish-speaking one by the Autonomous National University of Mexico (UNAM). Moreover, the betweenness index is rather close to the degree indicators, so we can state that these universities are the most important in their national or linguistic sub-network.

Country University web domain Betweenness nBetweenness
US Massachusetts Institute of Technology miy.edu 65422 6.54
UK University of Cambridge cam.ac.uk 20037 2.00
CH Swiss Federal Institute of Technology Zurich ethz.ch 18584 1.86
FR Jussieu Campus jussieu.fr 13280 1.32
JP University of Tokyo u-tokyo.ac.jp 12529 1.25
FI University of Helsinki helsinki.fi 9489 .95
MX Autonomous National University of Mexico unam.mx 7019 0.7
CA University of British Columbia ubc.ca 6813 0.68
TW National Taiwan University ntu.edu.tw 6604 .66
IT University of Bolonia unibo.it 6397 .63

Table 4. First 10 universities by their Betweenness in their countries


aaaaa
Discussion

aaaaa For some while now, the use of search engine data has been discussed because of the instability of their results over a short time period (Bar-Ilan, 1998; Rousseau, 1997), the weakness of their search operators (Igwersen, 1998) and the unreliability of their databases (Sullivan, 2003). However, recent studies have shown that current search engines have improved their consistency and reliability (Bar-Ilan, 2002; Bar-Ilan, 2004; Bar-Ilan, 2005a). Although their technical features have considerably improved, the coverage of their databases and the harvesting process are key issues to discuss. Bar-Ilan (2005b) detects that some search engines have serious problems indexing and retrieving non-Latin characters such as Japanese, Chinese or Russian. Vaughan and Thelwall (2004) showed that there is a local bias in favour of US and against East Asian web sites which are underrepresented in the search engines. Our work may be affected by these biases because large East Asian universities web domains are located far away in the graph (Figure 2), although they have a large amount of web pages. The great presence of the US universities may be slightly affected by these coverage biases as well. Interpreting these results must take into account these biases.
aaaaaThe link flows and web page distribution in the geographical map (Figure 1) follow a similar pattern to the European Union (Ortega & Aguillo, 2008b). Countries with many web pages attract and make more links than others, confirming the strong relationship between web pages and links (Thelwall & Harries, 2003; Katz & Cothey, 2006). The network graph also shows similar results to previous works. The World-class universities are grouped in local or national sub-networks which are connected with other sub-networks for linguistic or geographical reasons (Heimeriks & van den Besselaar, 2006; Ortega et al., 2008). These local or national sub-networks are structurally fitted to the community model of the Web suggested by Flake et al. (2000; 2002), several “gateway” universities act as hubs/authorities that connect the national communities or sub-networks between them (Barabasi & Albert, 1999; Kleinberg, 1999). This causes the reduction of the distances between nodes and explains the emergence of small-world phenomena on the Web (Björneborn, 2003).

aaaaa Conclusions

The World-class university network graph is comprised of national sub-networks that merge in a central core where the principal universities of each country pull their networks toward international link relationships. This network rests on the United States, which dominates the world network in conjunction with the aggregation of the European ones, especially the British and the German sub-networks. This situation may be caused mainly by the technological development of these countries and the production of international content, that is, English web pages. This second reason might explain the apparent backward situation of some East Asian countries.

aaaaa Referencesaaaa

Aguillo, I. F., Granadino, B., Ortega, J. L., & Prieto, J. A. (2006). Scientific Research Activity and Communication Measured With Cybermetrics Indicators. Journal of the American Society for Information Science and Technology, 57(10),1296-1302.

Aguillo, I. F., Ortega, J. L., & Fernandez, M. (2008). Webometrics Ranking of World Universities: Introduction, Methodology and Future Developments. Higher Education in Europe, 33(2-3)

Barabasi, A. L., Albert, R., & Jeong, H. (2000). Scale-Free Characteristics of Random Networks: the Topology of the World-Wide Web. Physica A, 281(1-4), 69-77.

Barabasi, A. L., & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439), 509-512.

Bar-Ilan, J. (1998). On the Overlap, the Precision and Estimated Recall of Search Engines, a Case Study of the Query "Erdos". Scientometrics, 42(2), 207-228.

Bar-Ilan, J. (2002). Methods for Measuring Search Engine Performance Over Time. Journal of the American Society for Information Science and Technology, 53(4), 308-319.

Bar-Ilan, J. (2004). The Use of Web Search Engines in Information Science Research. Annual Review of Information Science and Technology, 38, 231-288.

Bar-Ilan, J. (2005a). Expectations versus reality – Search engine features needed for Web research at mid 2005. Cybermetrics, 9(1), http://www.cindoc.csic.es/cybermetrics//articles/v9i1p2.html

Bar-Ilan, J. (2005b). Comparing Rankings of Search Results on the Web. Information Processing & Management, 41(6), 1511-1519.

Bjorneborn, L. (2003). Small-World Link Structures across an Academic Web Space: A Library and Information Science Approach. Copenhagen: Royal School of Library and Information Science. http://vip.db.dk/lb/phd/phd-thesis.pdf

Chen, C. (2003). Mapping Scientific Frontiers: The Quest for Knowledge Visualization. London: Springer-Verlag.

Cothey, V. (2005). Some preliminary results from a link-crawl of the European Union Research Area Web. In P. Ingwersen & B. Larsen (Eds.), Proceeding of the 10th International Conference of the International Society for Scientometrics and Informetrics. Stockholm: Karolinska University Press.

Faba-Perez, C., Zapico-Alonso, F., Guerrero-Bote, V. P., & De Moya-Anegon, F. (2005). Comparative Analysis of Webometric Measurements in Thematic Environments. Journal of the American Society for Information Science and Technology, 56(8), 779-785.

Heimeriks, G., & Van Den Besselaar, P. (2006). Analyzing hyperlinks networks: The meaning of hyperlink based indicators of knowledge production. Cybermetrics, 10(1,1). http://www.cindoc.csic.es/cybermetrics/articles/v10i1p1.html

Ingwersen, P. (1998). The Calculation of Web Impact Factors. Journal of Documentation, 54(2), 236-243.

Jenks, G. F. (1963). Generalization in statistical mapping. Annals of the Association of American Geographers, 53, 15-26.

Katz, J. S., & Cothey, V. (2006). Web indicators for complex innovation systems. Research Evaluation, 15(2), 85-95.

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632.

Kretschmer, H., & Kretschmer, T. (2006). Application of a New Centrality Measure for Social Network Analysis to Bibliometric and Webometric Data. In Proceeding of the IEEE International Conference on Digital Information Management (ICDIM). Bangalore, India: IEEE

Nooy, W. de, Mrvar, A., & Batagelj, V. (2005). Exploratory Social Network Analysis with Pajek. Cambridge, UK: Cambridge University Press.

Ortega, J. L., & Aguillo, I. F. (2007). La Web académica española en el contexto del Espacio Europeo de Educación Superior: Estudio exploratorio. El profesional de la información, 16(5), 417425.

Ortega, J. L., Aguillo, I. F., Cothey, V., & Scharnhorst, A. (2008). Maps of the academic web in the European Higher Education Area - an exploration of visual web indicators. Scientometrics, 74(2), 295-308.

Ortega, J. L. & Aguillo, I. F. (2008a). Visualization of the Nordic academic web: Link analysis using social network tools. Information Processing & Management, 44(4), 1624-1633.

Ortega, J. L. & Aguillo, I. F. (2008a). Linking patterns in the European Union’s Countries: geographical maps of the European academic web space. Journal of Information Science (in press) http://internetlab.cindoc.csic.es/cv/11/Ortega_Aguillo_2008.pdf

Polanco, X., Boudourides, M., Besagni, D., & Roche, I. (2001). Clustering and Mapping European University Web Sites Sample for Displaying Associations and Visualizing Networks. In Proceeding of the NTTS&ETK 2001 Conference. Hersonissos, Crete

Rousseau, R. (1997). Sitations: an Exploratory Study. Cybermetrics, 1(1). http://www.cindoc.csic.es/cybermetrics/articles/v1i1p1.html

Seidman, S. B. (1983). Network structure and minimum degree. Social Networks, 5, 269–287.

Smith, A. G. (2008). Benchmarking Google Scholar with the New Zealand PBRF research assessment exercise. Scientometrics, 74(2), 309-316.

Sullivan, D. (2003). Google Dance Syndrome Strikes Again. SearchEngineWatch.Com. http://searchenginewatch.com/showPage.html?page=3114531.

Thelwall, M. (2002a). Evidence for the existence of geographic trends in university web site interlinking. Journal of Documentation, 58(5), 563-574.

Thelwall, M. (2002b). A research and institutional size based model for national university web site interlinking, Journal of Documentation, 58(6), 683-694.

Thelwall, M., & Aguillo, I. F. (2003). La salud de las Web universitarias españolas. Revista Española De Documentación Científica, 26(3),

Thelwall, M., & Harries, G. (2003). The Connection Between the Research of a University and Counts of Links to Its Web Pages: an Investigation Based Upon a Classification of the Relationships of Pages to the Research of the Host University. Journal of the American Society for Information Science and Technology, 54(7), 594-602.

Thelwall, M., & Harries, G. (2004). Do The Web Sites of Higher Rated Scholars Have Significantly More Online Impact?  Journal of the American Society for Information Science and Technology, 55(2), 149-159.

Thelwall, M., Tang, R., & Price, L. (2003). Linguistic Patterns of Academic Web Use in Western Europe. Scientometrics, 56(3), 417-432.

Thelwall, M., & Zuccala, A. (2008). A University-Centred European Union Link Analysis. Scientometrics, 75(3), 407-420

Vaughan, L. (2006). Visualizing linguistic and cultural differences using Web co-link data. Journal of the American Society for Information Science and Technology, 57(9), 1178-1193.

Vaughan, L., & Thelwall, M. (2004). Search engine coverage bias: evidence and possible causes, Information Processing & Management, 40(4), 693-707.

Vaughan, L. & Thelwall, M. (2005). A modeling approach to uncover hyperlink patterns: The case of Canadian universities. Information Processing & Management, 41(2), 347-359.

Watts, D. J., & Strogatz, S. H. (1998). "Collective dynamics of 'small-world' networks". Nature, 393, 440-442.