Sloandora: A metric and UI for Integer Sequence exploration
This page describes a software project I developed in the fall of 2009 with the help of suggestions from the seqfan mailing list.
Contents
Introduction
The overall goal is to provide a software tool for computerassisted discovery of integer sequences from OEIS [5] that are related to a user's area of research. It is understood that the user can look at a sequence and tell the computer whether or not it is "interesting" (relevant to his area of research) or not. It is also understood that the computer can determine (through the use of automatic techniques possibly supplemented by input from other users or database "curators") which sequences are "likely" to be interesting given the user's indicated preferences so far.
An analogy for this type of service already exists in:
 The Pandora [3] music discovery service, which helps listeners find music they might like based on cognitive attributes
 Lydia and related systems for news analysis [2]
Overview of Design
The overall structure of my implementation is the following:
 A database of items (sequences in our case)
 A metric algorithm for measuring the cognitive distance between any two items
 A user interface for:
1. creating a topic (corresponding to a particular area of interest, or a field of research)
2. autorecommending items to the user
3. Userrating items as "relevant", "not relevant", or no comment (ambivalent)
The implementation given here (source code below) uses a perl commandline interface and a C compiled tool that implements the metric. It uses a local copy of the OEIS database in the form of eisBTfry00000.txt files (which are no longer available except through special request).
Comparison to Other Systems
There are a number of other tools for sequence exploration, which fulfill different objectives:
 The OEIS "webcam" available here allows users to see recent additions.
 The Superseeker for "mystery" sequences that are not currently in OEIS.
 Simon Plouffe's GFUN package (also primarily for mystery sequences).
 The seqfan list to ask questions to other interested people.
None of these were an adequate solution for my objectives.
The Sloandora Metric
The "metric" is the algorithm for determining how closely related two sequences are. It is central to any assisted browsing system.
The Pandora music discovery service [3] uses a database (the Music Genome Project) that was built up with the aid of real people, who listened to music and created a "genome" for each song. Each "genome" is essentially a set of cognitive attribute tags drawn from hundreds (or possibly thousands) of different musical attributes. Using humans is necessary partly because music has to be listened to (and somehow interpreted) to discover its cognitive content, and computers aren't too good at that yet.[4]
In our case the job is made easier by the fact that the OEIS entries are essentially just plain text, and the text itself is about the right size to use as the genome.
So I have chosen a characterlevel concordance metric as the initial algorithm for measuring similarity of sequences. My specific implementation works as follows:
 Go through the first source text, compiling a list of all 5byte substrings that appear and their frequencies. Go through the second source text and compile the same statistics for it. Thus for each string we have a count for A and another count for B.
 For each substring we then redistribute the counts into an "AB count". AB equals the lesser of A and B, and A and B are then both decreased by this amount. For example, if "facto" appears three times in source text A and two times in source text B, then the initial scan produces A{"facto"}=3 and B{"facto"}=2; this is then adjusted to produce A{"facto"}=1, AB{"facto"}=2 and B{"facto"}=0.
 Then we compute the coverage of A over B, which measures how much source A "covers" source B. This is simply the total of all AB counts divided by the sum of AB's and B's. So for example, if source B contains 1004 bytes (and thus 1000 5character strings), and if 700 of these 5character strings occur in A, then A's coverage of B is 0.7 (or 70%). B's coverage of A is computed similarly.
 The concordance between A and B is the lesser of the two coverage values.
Topic State Variables
The user's browsing is centered around "topics", which represent distinct areas of interest.
Each topic has a datafile that contains its current state. This state consists of:
 A list of every sequence in the OEIS, with a score telling how well that sequence fits this topic
 A list of every sequence that has been recommended to the user so far
 A flag for each recommendation telling whether it was given a positive or negative rating (or neither).
User Interface
The user begins by creating a "new topic" representing an area of research. To create a topic the user must select a single relevant sequence. This is the "seed sequence".
The program computes the concordance match between the seed sequence and all other sequences, and stores this information in a file representing the current state of that topic.
Then the program selects the sequence that has the highest concordance with the seed sequence, and presents it to the user. (In MacOS, I can open the sequence's web page into a broswer).
The user can then perform the following actions:
"" or Next: give me another automatic recommendation
Y or Relevant: this sequence belongs to this topic, use it to improve your recommendations in the future.
N or Irrelevant: this sequence does not belong to this topic, avoid sequences like it in the future.
A123456: Display a sequence on request, so I can decide if it is relevant and rate it accordingly.
When the user selects Next, another sequence is recommended and no other actions are taken.
When the user selects Relevant or Irrelevant, the software performs the concordance metric between the new sequence and all other sequences in the database. The resulting concordance score is either added to or subtracted from the current scores for each sequence, and the topic state is updated.
The action of requesting a specific sequence does not affect the topic state, unless the user subsequently rates it with the Relevant or Irrelevant command.
Results
Initial testing of the metric has been performed. I used the metric described above with a 5character token size to find the closest sequence in the OEIS to a given seed sequence:

Example Sessions
Here is an example session in which the user wants to study Egyptian fractions:
bash$ ./Sloandora _ _ (_`  _ ._ ._ _ _ ._ ._ , )  ( ) ,   (  ( )  , ~ ~ ~ `~` ~ ~ ~` ~ ~ `~` UserAided Integer Sequence AutoDiscovery System by Robert Munafo, with the seqfan list members Loading database index... Select a topic: egyptian There is no topic 'egyptian'. Do you want to create one? yes Enter the Anumber of a sequence to start this topic: A100140 Rating 'A100140' at 1; 2 threads: ...................................................... 95.27 CPUseconds. Next recommendation: A117116 (score 0.179) Denominators of an Egyptian Fraction for phi = (1+sqrt(5))/2. Command loop (type 'help' for help) Topic 'egyptian' > relevant Rating 'A117116' at 1; 2 threads: ........................................... 77.28 CPUseconds. Next recommendation: A006525 (score 0.420) Denominator of Egyptian fraction for e2. Topic 'egyptian' > show Current sequence is A006525; its score is 0.420 % A006525 %I M1553 %S 2,5,55,9999,3620211523,25838201785967533906, %T 3408847366605453091140558218322023440765 %N Denominator of Egyptian fraction for e2. %H <a href="http://www.research.att.com/~njas/sequences/Sindx_Ed.html#Egypt">Index entries for sequences related to Egyptian fractions</a> %H Eric Weisstein's World of Mathematics, <a href="http://mathworld.wolfram.com/EgyptianFraction.html">Egyptian Fraction</a> %Y Adjacent sequences: A006522 A006523 A006524 this_sequence A006526 A006527 A006528 %Y Sequence in context: A114029 A013171 A073422 this_sequence A042161 A101151 A030244 %K nonn %O 0,1 %A njas %E More terms from H. P. Robinson. Topic 'egyptian' > y Rating 'A006525' at 1; 2 threads: ...................................... 68.64 CPUseconds. Next recommendation: A006524 (score 0.793) Egyptian fraction for 1/ Pi. Topic 'egyptian' > A01234 Loading A001234 (current score: 0.341) Stirling numbers of first kind. Topic 'egyptian' > ? Sloandora commands: Show Display the entire record for the current sequence Web Display the current sequence in your web browser Y, Relevant The currently displayed sequence matches this topic N, Irrelevant The currently displayed sequence does NOT match Next, Skip Show me another sequence you think is relevant A012345 Show me sequence A012345 Best Reselect the latest recommendation Ratings List all sequences that have been rated in this topic Save Save the state of the current topic. Quit, Exit, Bye Exit Sloandora (also does "save") Help, ? Displays this help Topic 'egyptian' > best Current recommendation: A006524 (score 0.793) Egyptian fraction for 1/ Pi. Topic 'egyptian' > y Rating 'A006524' at 1; 2 threads: .......................................... 70.63 CPUseconds. Next recommendation: A006487 (score 1.143) Egyptian fraction for square root of 2. Topic 'egyptian' > q Saving topic 'egyptian'...done
Here is another example session in which a few sequences had to be rated Irrelevant. This session was run on a system with 16 hardware threads; the user is interested in anything related to the Mandelbrot set:
bash$ ./Sloandora _ _ (_`  _ ._ ._ _ _ ._ ._ , )  ( ) ,   (  ( )  , ~ ~ ~ `~` ~ ~ ~` ~ ~ `~` UserAided Integer Sequence AutoDiscovery System by Robert Munafo, with the seqfan list members Loading database index... Select a topic: mandelbrot There is no topic 'mandelbrot'. Do you want to create one? yes Enter the Anumber of a sequence to start this topic: 52154 Rating 'A052154' at 1; 16 threads: ...... 56.91 CPUseconds. Next recommendation: A098403 (score 0.203) Decimal expansion of area of Mandelbrot set. Command loop (type 'help' for help) Topic 'mandelbrot' > show Current sequence is A098403; its score is 0.203 % A098403 %S 1,5,0,6,5,9,1 %N Decimal expansion of area of Mandelbrot set. %D Nigel LesmoirGodon, Will Rood and Ralph Edney, 'Introducing Fractal Geometry,' Icon Books UK & Totem Books USA, 2000, page 97. %H Kerry Mitchell, <a href="http://www.fractalus.com/kerry/articles/area/mandelbrotarea.html">A Statistical Investigation of the Area of the Mandelbrot Set</a> %H Robert P. Munafo, <a href="http://www.mrob.com/pub/muency/areaofthemandelbrotset.html">Area of the Mandelbrot Set</a>. %H Eric W. Weisstein, <a href="http://mathworld.wolfram.com/MandelbrotSet.html">Link to a section of The World of Mathematics.</a>. %H Eric W. Weisstein, <a href="http://icl.pku.edu.cn/yujs/MathWorld/math/m/m049.htm">Mandelbrot Set</a>. %H Eric W. Weisstein, <a href="http://www.ics.uci.edu/~eppstein/junkyard/mandarea.html">Mandelbrot Area</a>. %e Area = 1.50659177 +/ 0.00000008. %Y Adjacent sequences: A098400 A098401 A098402 this_sequence A098404 A098405 A098406 %Y Sequence in context: A130415 A035550 A096287 this_sequence A069206 A091685 A062824 %K cons,more,nonn %O 1,2 %A Robert G. Wilson v (rgwv(AT)rgwv.com), Sep 06 2004 Topic 'mandelbrot' > y Rating 'A098403' at 1; 16 threads: ...... 52.62 CPUseconds. Next recommendation: A054671 (score 0.398) Denominators of (reduced) coefficients of Laurent series for conformal mapping from exterior of unit disk onto exterior of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A054671' at 1; 16 threads: ....... 56.96 CPUseconds. Next recommendation: A054670 (score 0.751) Numerators of (reduced) coefficients of Laurent series for conformal mapping from exterior of unit disk onto exterior of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A054670' at 1; 16 threads: ....... 58.57 CPUseconds. Next recommendation: A055544 (score 0.781) Total number of nodes in all rooted trees with n nodes. Topic 'mandelbrot' > n Rating 'A055544' at 1; 16 threads: ...... 50.87 CPUseconds. Next recommendation: A090037 (score 0.547) Number of tetramagic (4multimagic) series for squares of order n. Topic 'mandelbrot' > n Rating 'A090037' at 1; 16 threads: ...... 50.52 CPUseconds. Next recommendation: A102525 (score 0.415) Decimal expansion of log(2)/log(3). Topic 'mandelbrot' > n Rating 'A102525' at 1; 16 threads: ...... 57.35 CPUseconds. Next recommendation: A006874 (score 0.281) Muatoms of period n on continent of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A006874' at 1; 16 threads: ....... 60.26 CPUseconds. Next recommendation: A006875 (score 0.593) Nonseed muatoms of period n in Mandelbrot set. Topic 'mandelbrot' > y Rating 'A006875' at 1; 16 threads: ...... 49.89 CPUseconds. Next recommendation: A006876 (score 0.932) Mumolecules in Mandelbrot set whose seeds have period n. Topic 'mandelbrot' > show Current sequence is A006876; its score is 0.932 % A006876 %I M2883 %S 1,0,1,3,11,20,57,108,240,472,1013,1959,4083,8052,16315,32496, %T 65519,130464,262125,523209,1048353,2095084,4194281,8384100,16777120 %N Mumolecules in Mandelbrot set whose seeds have period n. %D B. B. Mandelbrot, The Fractal Geometry of Nature, Freeman, NY, 1982, p. 183. %D R. Penrose, The Emperor's New Mind, Penguin Books, NY, 1991, p. 138. %H R. P. Munafo, <a href="http://www.mrob.com/pub/muency.html">MuEncy  The Encyclopedia of the Mandelbrot Set</a> %Y Cf. A006874, A006875. %Y Adjacent sequences: A006873 A006874 A006875 this_sequence A006877 A006878 A006879 %Y Sequence in context: A075226 A028978 A082628 this_sequence A031239 A088619 A031318 %K nonn %O 1,4 %A mrob(AT)mrob.com (Robert P Munafo) Topic 'mandelbrot' > ratings Rated sequences in topic 'mandelbrot': Uprated: A006874 A054670 A052154 A054671 A098403 A006875 Downrated: A102525 A055544 A090037 Topic 'mandelbrot' > q Saving topic 'mandelbrot'...done
Performance Benchmarks
Every time a sequence is rated, the concordance metric must be performed between it and all other sequences in the database. The processing time needed (measured in "CPUseconds", number of logical CPUs times elapsed CPU time) is proportional to the size of the entire database times the log (base 2) of the size of the sequence being rated. The number of keyword comparisons per second is roughly the product of the database size and the article size. Using a 2007 version of the OEIS containing 131553 sequences (110 megabytes), a 2.33GHz Core 2 Duo laptop takes about 2025 seconds to rate a sequence, and an 8core, twosocket Xeon "Nehalem" workstation takes less than 4 seconds. The current algorithm is reasonably efficient (see the source) but could still be optimized further.
For significantly larger databases one would need to abandon the concordance metric for something a little less accurate. A number of algorithms have been developed for similar applications, using Markov chains or similar functions. Spam filters typically model character sequences as predictors for subsequent characters; and the Google PageRank system performs a similar function using webpage links as predictors for the relevance of linked pages. In Sloandora, the charactersequence analysis could be used as part of a more efficient concordance metric. The PageRank approach could be applied to existing links in the OEIS database (including when one sequence links explicitly to another, and when two sequences both link to something else like a journal article).
Future Improvements
In addition to performance improvements just discussed, I imagine doing some of the following soon:
It is clear from my results so far that specific types of text (like formulas, sequence numbers, journal titles, email addresses and names of people) should be identified and tagged in different ways. The current implementation uses a generic concordance metric, which is completely "agnostic" to the purpose of the different OEIS data fields. Fieldtagging is done in other text concordance applications (for example see [1] e.g. "Candidate Answer Pruning" on page 7).
A lot of this can be done fairly easily, for example, by giving certain fields (like %Y and %E lines) less importance and others (like %S and %T) more. This could be done at the user's recommendation, or by default weights that the system starts with for any newly created topic. The lowlevel implementation of this is fairly easy within the current concordance search algorithm.
It is relatively easy to "undo" a Relevant or Irrelevant rating, so I'll probably implement that next.
There should be options to display more than one recommendation at a time, and give the user a choice of which one to look at and/or rate. (Only a tiny additional amount of computation would be required.)
When new sequences are added to the OEIS database, the topic needs to be updated. If the updated sequence was not part of the topic definition (i.e. was never given a rating) then all we need to do is update that sequence's score (the sum of its concordance against all Yrated items minus its concordance against all Nrated items). If the updated sequence is one of the defining sequences for the topic, then the process takes longer: First "undo" the sequence's rating; then "rerate" the sequence using the new text.
With greater effort, the search algorithm could be improved by the use of a contentaddressable index. There is no obvious way to retain the exact metric described above, but the contentaddressable index could be used to model the OEIS database as a set of webpages that are linked if they share substrings, and algorithms designed for personalized webpage ranking (e.g. [6]) could provide the autorecommendation.
Current Source Code
Sloandora runs on UNIX, Linux, or MacOS X systems with Perl and a C compiler.
The C source for the concordance metric is here.
The controlling Perl script is here.
You will also need the eisBTfry00000.txt, eisBTfry00001.txt, etc. files and I can't help you with that, but you can ask the seqfan list (or perhaps email N.J.A. Sloane directly if he already knows you).
A README file explaining how to put the pieces together is here
If you need help setting this up, contact me by email ("email me" link at the bottom of this page).
[1] J.H. Kil, L. Lloyd, and S. Skiena, Question Answering with Lydia, TREC2005, Proceedings of the 2005 Text REtrieval Conference, http://trec.nist.gov/pubs/trec14/t14_proceedings.html
[3] Pandora® Internet Radio, (interactive music system), http://www.pandora.com/, 20072010.
[4] Rob Walker, The song decoders, New York Times Magazine, 2009 Oct 14. http://www.nytimes.com/2009/10/18/magazine/18Pandorat.html
[5] The OEIS Foundation, Inc. (http://www.oeisf.org/) has stewardship of The OnLine Encyclopedia of Integer Sequences (OEIS). The legacy (or "Classic") server is at http://oeis.org/classic, and the Wiki is at http://www.oeis.org/wiki.
[6] Christian Borgs et al., A Sublinear Time Algorithm for PageRank Computations, 2012. At arxiv:1202.2771
The OnLine Encyclopedia of Integer Sequences ("OEIS") is intellectual property of The OEIS Foundation Inc.
Pandora is a registered trademark of Pandora Media, Inc.
s.13