Sloandora: A metric and UI for Integer Sequence exploration  

This page describes a software project I developed in the fall of 2009 with the help of suggestions from the seqfan mailing list.

Contents

Introduction

Overview of Design

Comparison to Other Systems

The Metric

Topic State Variables

User Interface

Results

Example Sessions

    Egyptian Fractions Example

    "Mandelbrot" Topic Example

Performance Benchmarks

Future Improvements

Current Source Code

References


Introduction

The overall goal is to provide a software tool for computer-assisted discovery of integer sequences from OEIS [5] that are related to a user's area of research. It is understood that the user can look at a sequence and tell the computer whether or not it is "interesting" (relevant to his area of research) or not. It is also understood that the computer can determine (through the use of automatic techniques possibly supplemented by input from other users or database "curators") which sequences are "likely" to be interesting given the user's indicated preferences so far.

An analogy for this type of service already exists in:

Overview of Design

The overall structure of my implementation is the following:

1. creating a topic (corresponding to a particular area of interest, or a field of research)

2. auto-recommending items to the user

3. User-rating items as "relevant", "not relevant", or no comment (ambivalent)

The implementation given here (source code below) uses a perl command-line interface and a C compiled tool that implements the metric. It uses a local copy of the OEIS database in the form of eisBTfry00000.txt files (which are no longer available except through special request).

Comparison to Other Systems

There are a number of other tools for sequence exploration, which fulfill different objectives:

None of these were an adequate solution for my objectives.

The Sloandora Metric

The "metric" is the algorithm for determining how closely related two sequences are. It is central to any assisted browsing system.

The Pandora music discovery service [3] uses a database (the Music Genome Project) that was built up with the aid of real people, who listened to music and created a "genome" for each song. Each "genome" is essentially a set of cognitive attribute tags drawn from hundreds (or possibly thousands) of different musical attributes. Using humans is necessary partly because music has to be listened to (and somehow interpreted) to discover its cognitive content, and computers aren't too good at that yet.[4]

In our case the job is made easier by the fact that the OEIS entries are essentially just plain text, and the text itself is about the right size to use as the genome.

So I have chosen a character-level concordance metric as the initial algorithm for measuring similarity of sequences. My specific implementation works as follows:

Topic State Variables

The user's browsing is centered around "topics", which represent distinct areas of interest.

Each topic has a datafile that contains its current state. This state consists of:

User Interface

The user begins by creating a "new topic" representing an area of research. To create a topic the user must select a single relevant sequence. This is the "seed sequence".

The program computes the concordance match between the seed sequence and all other sequences, and stores this information in a file representing the current state of that topic.

Then the program selects the sequence that has the highest concordance with the seed sequence, and presents it to the user. (In MacOS, I can open the sequence's web page into a broswer).

The user can then perform the following actions:

"--" or Next: give me another automatic recommendation

Y or Relevant: this sequence belongs to this topic, use it to improve your recommendations in the future.

N or Irrelevant: this sequence does not belong to this topic, avoid sequences like it in the future.

A123456: Display a sequence on request, so I can decide if it is relevant and rate it accordingly.

When the user selects Next, another sequence is recommended and no other actions are taken.

When the user selects Relevant or Irrelevant, the software performs the concordance metric between the new sequence and all other sequences in the database. The resulting concordance score is either added to or subtracted from the current scores for each sequence, and the topic state is updated.

The action of requesting a specific sequence does not affect the topic state, unless the user subsequently rates it with the Relevant or Irrelevant command.

Results

Initial testing of the metric has been performed. I used the metric described above with a 5-character token size to find the closest sequence in the OEIS to a given seed sequence:

Seed Match Score Notes
A000213 (Tribonacci numbers) A001648 0.270 Tetranacci numbers without the leading 4
A000040 (The Prime numbers) A019590 0.225 Fermat's Last Theorem
A000079 (Powers of 2) A005408 0.284 The odd numbers
A005577 (arrangements of pennies in rows) A005576 0.447 A different pennies sequence
A002808 (Composite numbers) A018252 0.357 The nonprime numbers
A006451 (Solution to a diophantine equation) A006454 0.324 Both share a reference to the article "How four dogs meet in a field".
A006887 (Kaprekar triples) A060768 0.268 Pseudo-Kaprekar triples
A100140 (Greedy Egyptian fractions) A117116 0.181 Denominators of an Egyptian Fraction for phi = (1+sqrt(5))/2

Example Sessions

Egyptian Fractions Example

Here is an example session in which the user wants to study Egyptian fractions:

bash$ ./Sloandora _ _ (_` | _ ._ ._ _| _ ._ ._ , ) | ( ) ,-| | | ( | ( ) | ,-| ~ ~ ~ `~` ~ ~ ~` ~ ~ `~`   User-Aided Integer Sequence Auto-Discovery System by Robert Munafo, with the seqfan list members   Loading database index... Select a topic: egyptian There is no topic 'egyptian'. Do you want to create one? yes Enter the A-number of a sequence to start this topic: A100140 Rating 'A100140' at 1; 2 threads: ...................................................... 95.27 CPU-seconds. Next recommendation: A117116 (score 0.179) Denominators of an Egyptian Fraction for phi = (1+sqrt(5))/2.   Command loop (type 'help' for help) Topic 'egyptian' > relevant Rating 'A117116' at 1; 2 threads: ........................................... 77.28 CPU-seconds. Next recommendation: A006525 (score 0.420) Denominator of Egyptian fraction for e-2. Topic 'egyptian' > show Current sequence is A006525; its score is 0.420 %---- A006525 %I M1553 %S 2,5,55,9999,3620211523,25838201785967533906, %T 3408847366605453091140558218322023440765 %N Denominator of Egyptian fraction for e-2. %H <a href="http://www.research.att.com/~njas/sequences/Sindx_Ed.html#Egypt">Index entries for sequences related to Egyptian fractions</a> %H Eric Weisstein's World of Mathematics, <a href="http://mathworld.wolfram.com/EgyptianFraction.html">Egyptian Fraction</a> %Y Adjacent sequences: A006522 A006523 A006524 this_sequence A006526 A006527 A006528 %Y Sequence in context: A114029 A013171 A073422 this_sequence A042161 A101151 A030244 %K nonn %O 0,1 %A njas %E More terms from H. P. Robinson.   Topic 'egyptian' > y Rating 'A006525' at 1; 2 threads: ...................................... 68.64 CPU-seconds. Next recommendation: A006524 (score 0.793) Egyptian fraction for 1/ Pi. Topic 'egyptian' > A01234 Loading A001234 (current score: 0.341) Stirling numbers of first kind. Topic 'egyptian' > ? Sloandora commands:   Show Display the entire record for the current sequence Web Display the current sequence in your web browser Y, Relevant The currently displayed sequence matches this topic N, Irrelevant The currently displayed sequence does NOT match Next, Skip Show me another sequence you think is relevant A012345 Show me sequence A012345 Best Re-select the latest recommendation Ratings List all sequences that have been rated in this topic Save Save the state of the current topic. Quit, Exit, Bye Exit Sloandora (also does "save") Help, ? Displays this help   Topic 'egyptian' > best Current recommendation: A006524 (score 0.793) Egyptian fraction for 1/ Pi. Topic 'egyptian' > y Rating 'A006524' at 1; 2 threads: .......................................... 70.63 CPU-seconds. Next recommendation: A006487 (score 1.143) Egyptian fraction for square root of 2. Topic 'egyptian' > q Saving topic 'egyptian'...done

"Mandelbrot" Topic Example

Here is another example session in which a few sequences had to be rated Irrelevant. This session was run on a system with 16 hardware threads; the user is interested in anything related to the Mandelbrot set:

bash$ ./Sloandora _ _ (_` | _ ._ ._ _| _ ._ ._ , ) | ( ) ,-| | | ( | ( ) | ,-| ~ ~ ~ `~` ~ ~ ~` ~ ~ `~`   User-Aided Integer Sequence Auto-Discovery System by Robert Munafo, with the seqfan list members   Loading database index... Select a topic: mandelbrot There is no topic 'mandelbrot'. Do you want to create one? yes Enter the A-number of a sequence to start this topic: 52154 Rating 'A052154' at 1; 16 threads: ...... 56.91 CPU-seconds. Next recommendation: A098403 (score 0.203) Decimal expansion of area of Mandelbrot set.   Command loop (type 'help' for help) Topic 'mandelbrot' > show Current sequence is A098403; its score is 0.203 %---- A098403   %S 1,5,0,6,5,9,1 %N Decimal expansion of area of Mandelbrot set. %D Nigel Lesmoir-Godon, Will Rood and Ralph Edney, 'Introducing Fractal Geometry,' Icon Books UK & Totem Books USA, 2000, page 97. %H Kerry Mitchell, <a href="http://www.fractalus.com/kerry/articles/area/mandelbrot-area.html">A Statistical Investigation of the Area of the Mandelbrot Set</a> %H Robert P. Munafo, <a href="http://www.mrob.com/pub/muency/areaofthemandelbrotset.html">Area of the Mandelbrot Set</a>. %H Eric W. Weisstein, <a href="http://mathworld.wolfram.com/MandelbrotSet.html">Link to a section of The World of Mathematics.</a>. %H Eric W. Weisstein, <a href="http://icl.pku.edu.cn/yujs/MathWorld/math/m/m049.htm">Mandelbrot Set</a>. %H Eric W. Weisstein, <a href="http://www.ics.uci.edu/~eppstein/junkyard/mand-area.html">Mandelbrot Area</a>. %e Area = 1.50659177 +/- 0.00000008. %Y Adjacent sequences: A098400 A098401 A098402 this_sequence A098404 A098405 A098406 %Y Sequence in context: A130415 A035550 A096287 this_sequence A069206 A091685 A062824 %K cons,more,nonn %O 1,2 %A Robert G. Wilson v (rgwv(AT)rgwv.com), Sep 06 2004   Topic 'mandelbrot' > y Rating 'A098403' at 1; 16 threads: ...... 52.62 CPU-seconds. Next recommendation: A054671 (score 0.398) Denominators of (reduced) coefficients of Laurent series for conformal mapping from exterior of unit disk onto exterior of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A054671' at 1; 16 threads: ....... 56.96 CPU-seconds. Next recommendation: A054670 (score 0.751) Numerators of (reduced) coefficients of Laurent series for conformal mapping from exterior of unit disk onto exterior of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A054670' at 1; 16 threads: ....... 58.57 CPU-seconds. Next recommendation: A055544 (score 0.781) Total number of nodes in all rooted trees with n nodes. Topic 'mandelbrot' > n Rating 'A055544' at -1; 16 threads: ...... 50.87 CPU-seconds. Next recommendation: A090037 (score 0.547) Number of tetramagic (4-multimagic) series for squares of order n. Topic 'mandelbrot' > n Rating 'A090037' at -1; 16 threads: ...... 50.52 CPU-seconds. Next recommendation: A102525 (score 0.415) Decimal expansion of log(2)/log(3). Topic 'mandelbrot' > n Rating 'A102525' at -1; 16 threads: ...... 57.35 CPU-seconds. Next recommendation: A006874 (score 0.281) Mu-atoms of period n on continent of Mandelbrot set. Topic 'mandelbrot' > y Rating 'A006874' at 1; 16 threads: ....... 60.26 CPU-seconds. Next recommendation: A006875 (score 0.593) Non-seed mu-atoms of period n in Mandelbrot set. Topic 'mandelbrot' > y Rating 'A006875' at 1; 16 threads: ...... 49.89 CPU-seconds. Next recommendation: A006876 (score 0.932) Mu-molecules in Mandelbrot set whose seeds have period n. Topic 'mandelbrot' > show Current sequence is A006876; its score is 0.932 %---- A006876 %I M2883 %S 1,0,1,3,11,20,57,108,240,472,1013,1959,4083,8052,16315,32496, %T 65519,130464,262125,523209,1048353,2095084,4194281,8384100,16777120 %N Mu-molecules in Mandelbrot set whose seeds have period n. %D B. B. Mandelbrot, The Fractal Geometry of Nature, Freeman, NY, 1982, p. 183. %D R. Penrose, The Emperor's New Mind, Penguin Books, NY, 1991, p. 138. %H R. P. Munafo, <a href="http://www.mrob.com/pub/muency.html">Mu-Ency - The Encyclopedia of the Mandelbrot Set</a> %Y Cf. A006874, A006875. %Y Adjacent sequences: A006873 A006874 A006875 this_sequence A006877 A006878 A006879 %Y Sequence in context: A075226 A028978 A082628 this_sequence A031239 A088619 A031318 %K nonn %O 1,4 %A mrob(AT)mrob.com (Robert P Munafo)   Topic 'mandelbrot' > ratings Rated sequences in topic 'mandelbrot': Uprated: A006874 A054670 A052154 A054671 A098403 A006875 Down-rated: A102525 A055544 A090037 Topic 'mandelbrot' > q Saving topic 'mandelbrot'...done

Performance Benchmarks

Every time a sequence is rated, the concordance metric must be performed between it and all other sequences in the database. The processing time needed (measured in "CPU-seconds", number of logical CPUs times elapsed CPU time) is proportional to the size of the entire database times the log (base 2) of the size of the sequence being rated. The number of keyword comparisons per second is roughly the product of the database size and the article size. Using a 2007 version of the OEIS containing 131553 sequences (110 megabytes), a 2.33-GHz Core 2 Duo laptop takes about 20-25 seconds to rate a sequence, and an 8-core, two-socket Xeon "Nehalem" workstation takes less than 4 seconds. The current algorithm is reasonably efficient (see the source) but could still be optimized further.

For significantly larger databases one would need to abandon the concordance metric for something a little less accurate. A number of algorithms have been developed for similar applications, using Markov chains or similar functions. Spam filters typically model character sequences as predictors for subsequent characters; and the Google PageRank system performs a similar function using webpage links as predictors for the relevance of linked pages. In Sloandora, the character-sequence analysis could be used as part of a more efficient concordance metric. The PageRank approach could be applied to existing links in the OEIS database (including when one sequence links explicitly to another, and when two sequences both link to something else like a journal article).

Future Improvements

In addition to performance improvements just discussed, I imagine doing some of the following soon:

It is clear from my results so far that specific types of text (like formulas, sequence numbers, journal titles, email addresses and names of people) should be identified and tagged in different ways. The current implementation uses a generic concordance metric, which is completely "agnostic" to the purpose of the different OEIS data fields. Field-tagging is done in other text concordance applications (for example see [1] e.g. "Candidate Answer Pruning" on page 7).

A lot of this can be done fairly easily, for example, by giving certain fields (like %Y and %E lines) less importance and others (like %S and %T) more. This could be done at the user's recommendation, or by default weights that the system starts with for any newly created topic. The low-level implementation of this is fairly easy within the current concordance search algorithm.

It is relatively easy to "undo" a Relevant or Irrelevant rating, so I'll probably implement that next.

There should be options to display more than one recommendation at a time, and give the user a choice of which one to look at and/or rate. (Only a tiny additional amount of computation would be required.)

When new sequences are added to the OEIS database, the topic needs to be updated. If the updated sequence was not part of the topic definition (i.e. was never given a rating) then all we need to do is update that sequence's score (the sum of its concordance against all Y-rated items minus its concordance against all N-rated items). If the updated sequence is one of the defining sequences for the topic, then the process takes longer: First "undo" the sequence's rating; then "re-rate" the sequence using the new text.

With greater effort, the search algorithm could be improved by the use of a content-addressable index. There is no obvious way to retain the exact metric described above, but the content-addressable index could be used to model the OEIS database as a set of webpages that are linked if they share substrings, and algorithms designed for personalized webpage ranking (e.g. [6]) could provide the auto-recommendation.


Current Source Code

Sloandora runs on UNIX, Linux, or MacOS X systems with Perl and a C compiler.

The C source for the concordance metric is here.

The controlling Perl script is here.

You will also need the eisBTfry00000.txt, eisBTfry00001.txt, etc. files and I can't help you with that, but you can ask the seqfan list (or perhaps email N.J.A. Sloane directly if he already knows you).

A README file explaining how to put the pieces together is here

If you need help setting this up, contact me by email ("email me" link at the bottom of this page).


References

[1] J.H. Kil, L. Lloyd, and S. Skiena, Question Answering with Lydia, TREC-2005, Proceedings of the 2005 Text REtrieval Conference, http://trec.nist.gov/pubs/trec14/t14_proceedings.html

[2] M. Bautin and S. Skiena, Concordance-Based Entity-Oriented Search, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2007) 586-592.

[3] Pandora® Internet Radio, (interactive music system), http://www.pandora.com/, 2007-2010.

[4] Rob Walker, The song decoders, New York Times Magazine, 2009 Oct 14. http://www.nytimes.com/2009/10/18/magazine/18Pandora-t.html

[5] The OEIS Foundation, Inc. (http://www.oeisf.org/) has stewardship of The On-Line Encyclopedia of Integer Sequences (OEIS). The legacy (or "Classic") server is at http://oeis.org/classic, and the Wiki is at http://www.oeis.org/wiki.

[6] Christian Borgs et al., A Sublinear Time Algorithm for PageRank Computations, 2012. At arxiv:1202.2771


The On-Line Encyclopedia of Integer Sequences ("OEIS") is intellectual property of The OEIS Foundation Inc.

Pandora is a registered trademark of Pandora Media, Inc.


Robert Munafo's home pages on HostMDS   © 1996-2014 Robert P. Munafo.   about   contact   Google+   mrob27   @mrob_27
This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. Details here
This page was last updated on 2014 Mar 10.
s.13