Algorithm for finding the most common repeating pattern in a potentially irregular string/array

up vote
0
down vote

favorite

Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.

I am interested in detecting patterns in strings/arrays such as ABCABCABCABC where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB or ABCBCBCA because of various failures.

My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC for all the given examples, as this is the longest and most common pattern.

One idea I had was simply to something like:

import numpy as np

from collections import Counter



# ABCABCABCABC encoded with integers 

A = np.array(

  [[ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3]])



c = Counter(map(tuple, A)).most_common()[0]



# ((1,2,3), 4)

But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter each time to assess the regularity of an appearing (or not) pattern.

Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.

Is there a better way?

EDIT

More details:

Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago

I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago

Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago

How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago

Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago

|
show 5 more comments

up vote
0
down vote

favorite

Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.

One idea I had was simply to something like:

import numpy as np

from collections import Counter



# ABCABCABCABC encoded with integers 

A = np.array(

  [[ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3]])



c = Counter(map(tuple, A)).most_common()[0]



# ((1,2,3), 4)

Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.

Is there a better way?

EDIT

More details:

Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago

I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago

Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago

How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago

Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago

|
show 5 more comments

up vote
0
down vote

favorite

Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.

One idea I had was simply to something like:

import numpy as np

from collections import Counter



# ABCABCABCABC encoded with integers 

A = np.array(

  [[ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3]])



c = Counter(map(tuple, A)).most_common()[0]



# ((1,2,3), 4)

Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.

Is there a better way?

EDIT

More details:

Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.

One idea I had was simply to something like:

import numpy as np

from collections import Counter



# ABCABCABCABC encoded with integers 

A = np.array(

  [[ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3],

   [ 1 ,2, 3]])



c = Counter(map(tuple, A)).most_common()[0]



# ((1,2,3), 4)

Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.

Is there a better way?

EDIT

More details:

Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.

python algorithm count

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

edited 21 hours ago

asked 22 hours ago

Astrid

66621530

asked 22 hours ago

Astrid

66621530

asked 22 hours ago

Astrid

66621530

How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago

I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago

Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago

How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago

Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago

|
show 5 more comments

How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago

I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago

Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago

How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago

Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago

How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago

I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago

Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago

How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago

Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago

|
show 5 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371043%2falgorithm-for-finding-the-most-common-repeating-pattern-in-a-potentially-irregul%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ftyjtk