Status
Since 2021‒02‒01, this is a legacy site. The underlying data
is updated once a year. The calculations are run until the
network is fully calculated.
Background
Founded by
me [
pkr1] in 1999, the
RePEc Author Service (RAS)
allows for authors to claim papers they have written in the
RePEc digital library. When two
authors claim the same paper, they are co-authors. The
relationship of co-authorship creates a network between most
authors. The shortest paths in that network between different
pairs of authors are of different lengths. Between co-authors,
the shortest path has length one; when two authors have each
written with a third author, but have not written a paper
together, the shortest path relating those two is of length two,
and so on.
There must be at least one author for whom the average length path
to all others in the network is the minimum. This author will have the highest
closeness ranking.
There must be at least one author who appears on the largest number
of shortest paths between any two other authors. This author will have the highest
betweenness ranking. A number
authors are marginal, i.e., they are not on a shortest path between any
other pair of authors. CollEc does not report a betweenness rank for them.
Calculations
This is a binary network. Two authors are linked if they have
claimed at least one paper in common in RAS. CollEc uses a
software called icanis written by
me [
pkr1] . Icanis only uses the giant
component of the graph. It calculates shortest paths from each
author to all others using an extension of
Dijkstra’s shortest path algorithm.
The extension allows for the calculation of all shortest paths
between two authors, rather than just one as the standard
algorithm would do. This is done for each starting author. The
advantage of this approach is that it’s easy to parallelize.
The incovenience is that essentially, the job is done
twice. Recall that the network is symmetric.
For any starting author, the results are
stored in a a file in a path depending on the authors RAS short
handle. Thus for me, the file is “k/r/1/paths”. Each line
in that file starts with the length of the path, then the full
path, bar the common start author, in that case me.
As is common for binary networks, we can find multiple paths
between two author. For some couples of authors, this leads to
large set of shortest paths. Strong shortest path multiplicity would
clog the user interface of CollEc. Fortunately, most of the
multiple binary shortest paths can be eliminated when we take
collaboration strength as a secondary criterion. We can build a
second graph between the nodes using a weighted network. In that
network, authors who have written more papers together are
closer. Specifically, for any paper that two authors have
written together, we augment the collaboration strength between
them by 1/(n-1) where n is the total number of
authors on that paper. We can evaluate all shortest paths for
this network and elimate most, but not all, multiplicities. Thus
in the case of pkr1, calculated on 2019‒04‒02, we still have, say
8 | ple338 | psm18 | psz8 | pwa118 | ppi113 | pch1577 | pgr621 | pvo256 |
8 | ple338 | psm18 | psz8 | pwa118 | ppi113 | pch1577 | pla844 | pvo256 |
in the “paths” file.
You may wonder why we are not using that weighted network to
begin with. Well, it turns out that within that weighted
network, shortest paths between two authors that have written
few papers together often pass via a third author that both of
them have prolifically collaborated with. Users would find
this rather odd!
The elimination of these shortest paths using the weighted
network does not impact closeness centrality rankings. It
does impact betweenness centrality. To calculate betweeness
centrality, icanis uses a special “inter” file. In that file, it
summarizes the presence of any author on the paths leading to
the starting author. Marginal authors don’t get a mention
here. An example line from my “k/r/1/inter”
20.5 pko253
The non-integer value is a feature. It appears because the
summarization detects correctly handles the residual path
multiplicity. If an
intermediate author only appears on n out of p
paths, the betweeness importance of the author only rises by
n/p.
It takes a mighty loong time to calculated all the paths. On the
other hand, RAS data is constant flux. Thus the CollEc data is
only assymptodically correct. This is the main problem of the
system. Rewriting the path calculations in C could go a long way
to reducing the gap, but it is unlikely to eliminate them.
Maybe a very expensive machine could do it but that’s a pipe dream
for RePEc. In the mean time, I deploy two server sponsored for RePEc
for the job.
Database use
Icanis uses a
mariaDB database for nodes.
Here is the table structure
- handle
- RAS handle
- name
- name as used in RAS
- homepage
- only if available
- node_tist
- time we updated the data about the node
- path_tist
- time we updated the pathss for this node
- closeness
- the closeness of the node
- closeness_rank
- the closeness ranking of the node
- path_summary_unique
- a string that summarises how many paths the node has of what length
- betweenness
- the value of betweenness
- betweenness_rank
- the rank of betweenness
For me, on 2019‒04‒05, that is
- handle
- pkr1
- name
- Thomas Krichel
- homepage
- http://openlib.org/home/krichel
- node_tist
- 1345890126
- path_tist
- 1554226314
- closeness
- 5.05979
- closeness_rank
- 10027
- path_summary_unique
- 1=7|2=116|3=1193|4=9893|5=20037|6=8613|7=1842|8=459|9=173|10=71|11=26|12=16|13=6|14=7|15=1
- betweenness
- 2.10804
- betweenness_rank
- 4845
The database is used to update the static pages once a day. It
is also used to find users in searches.
There are short of 100 million shortest path. There is
no need to stuff them into a database. When a user requests
the shortest path from an author A to B, icanis
looks at the paths file from A, and the paths files
for B. It picks the file with the most recent time
stemp. Assume the calculations for A are more
recent, then it just greps the A paths file for B.
Thus makes for very fast lookups.
Backup
The software is backed up internally. I export the paths
to a machine under the control of
Nikos Askitas
[
pas55].