CollEc documentation

Background

Founded by me [pkr1] in 1999, the RePEc Author Service (RAS) allows for authors to claim papers they have written in the RePEc digital library. When two authors claim the same paper, they are co-authors. The relationship of co-authorship creates a network between most authors. The shortest paths in that network between different pairs of authors are of different lengths. Between co-authors, the shortest path has length one; when two authors have each written with a third author, but have not written a paper together, the shortest path relating those two is of length two, and so on.
There must be at least one author for whom the average length path to all others in the network is the minimum. This author will have the highest closeness ranking.
There must be at least one author who appears on the largest number of shortest paths between any two other authors. This author will have the highest betweenness ranking. A number authors are marginal, i.e., they are not on a shortest path between any other pair of authors. CollEc does not report a betweenness rank for them.

Calculations

This is a binary network. Two authors are linked if they have claimed at least one paper in common in RAS. CollEc uses a software called icanis written by me [pkr1] . Icanis only uses the giant component of the graph. It calculates shortest paths from each author to all others using an extension of Dijkstra’s shortest path algorithm. The extension allows for the calculation of all shortest paths between two authors, rather than just one as the standard algorithm would do. This is done for each starting author. The advantage of this approach is that it’s easy to parallelize. The incovenience is that essentially, the job is done twice. Recall that the network is symmetric.
For any starting author, the results are stored in a a file in a path depending on the authors RAS short handle. Thus for me, the file is “k/r/1/paths”. Each line in that file starts with the length of the path, then the full path, bar the common start author, in that case me.
As is common for binary networks, we can find multiple paths between two author. For some couples of authors, this leads to large set of shortest paths. Strong shortest path multiplicity would clog the user interface of CollEc. Fortunately, most of the multiple binary shortest paths can be eliminated when we take collaboration strength as a secondary criterion. We can build a second graph between the nodes using a weighted network. In that network, authors who have written more papers together are closer. Specifically, for any paper that two authors have written together, we augment the collaboration strength between them by 1/(n-1) where n is the total number of authors on that paper. We can evaluate all shortest paths for this network and elimate most, but not all, multiplicities. Thus in the case of pkr1, calculated on 2019‒04‒02, we still have, say
8ple338psm18psz8pwa118ppi113pch1577pgr621pvo256
8ple338psm18psz8pwa118ppi113pch1577pla844pvo256
in the “paths” file.
You may wonder why we are not using that weighted network to begin with. Well, it turns out that within that weighted network, shortest paths between two authors that have written few papers together often pass via a third author that both of them have prolifically collaborated with. Users would find this rather odd!
The elimination of these shortest paths using the weighted network does not impact closeness centrality rankings. It does impact betweenness centrality. To calculate betweeness centrality, icanis uses a special “inter” file. In that file, it summarizes the presence of any author on the paths leading to the starting author. Marginal authors don’t get a mention here. An example line from my “k/r/1/inter”
20.5    pko253
The non-integer value is a feature. It appears because the summarization detects correctly handles the residual path multiplicity. If an intermediate author only appears on n out of p paths, the betweeness importance of the author only rises by n/p.
It takes a mighty loong time to calculated all the paths. On the other hand, RAS data is constant flux. Thus the CollEc data is only assymptodically correct. This is the main problem of the system. Rewriting the path calculations in C could go a long way to reducing the gap, but it is unlikely to eliminate them. Maybe a very expensive machine could do it but that’s a pipe dream for RePEc. In the mean time, I deploy two server sponsored for RePEc for the job.

Database use

Icanis uses a mariaDB database for nodes. Here is the table structure
handle
RAS handle
name
name as used in RAS
homepage
only if available
node_tist
time we updated the data about the node
path_tist
time we updated the pathss for this node
closeness
the closeness of the node
closeness_rank
the closeness ranking of the node
path_summary_unique
a string that summarises how many paths the node has of what length
betweenness
the value of betweenness
betweenness_rank
the rank of betweenness
For me, on 2019‒04‒05, that is
handle
pkr1
name
Thomas Krichel
homepage
http://openlib.org/home/krichel
node_tist
1345890126
path_tist
1554226314
closeness
5.05979
closeness_rank
10027
path_summary_unique
1=7|2=116|3=1193|4=9893|5=20037|6=8613|7=1842|8=459|9=173|10=71|11=26|12=16|13=6|14=7|15=1
betweenness
2.10804
betweenness_rank
4845
The database is used to update the static pages once a day. It is also used to find users in searches.
There are short of 100 million shortest path. There is no need to stuff them into a database. When a user requests the shortest path from an author A to B, icanis looks at the paths file from A, and the paths files for B. It picks the file with the most recent time stemp. Assume the calculations for A are more recent, then it just greps the A paths file for B. Thus makes for very fast lookups.

Backup

The software is backed up internally. I export the paths to a machine under the control of Nikos Askitas [pas55].