Algorithm for finding the most common repeating pattern in a potentially irregular string/array











up vote
0
down vote

favorite












Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.



I am interested in detecting patterns in strings/arrays such as ABCABCABCABC where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB or ABCBCBCA because of various failures.



My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC for all the given examples, as this is the longest and most common pattern.



One idea I had was simply to something like:



import numpy as np
from collections import Counter

# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])

c = Counter(map(tuple, A)).most_common()[0]

# ((1,2,3), 4)


But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter each time to assess the regularity of an appearing (or not) pattern.



Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.



Is there a better way?



EDIT



More details:




  • Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

  • Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.










share|improve this question
























  • How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
    – DatHydroGuy
    22 hours ago










  • I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
    – Astrid
    22 hours ago












  • Can the subsequences have repeated digits/letters?
    – Robby Cornelissen
    21 hours ago










  • How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
    – Paddy3118
    21 hours ago












  • Ahh fair let me add some more details. Please see updated question.
    – Astrid
    21 hours ago

















up vote
0
down vote

favorite












Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.



I am interested in detecting patterns in strings/arrays such as ABCABCABCABC where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB or ABCBCBCA because of various failures.



My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC for all the given examples, as this is the longest and most common pattern.



One idea I had was simply to something like:



import numpy as np
from collections import Counter

# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])

c = Counter(map(tuple, A)).most_common()[0]

# ((1,2,3), 4)


But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter each time to assess the regularity of an appearing (or not) pattern.



Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.



Is there a better way?



EDIT



More details:




  • Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

  • Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.










share|improve this question
























  • How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
    – DatHydroGuy
    22 hours ago










  • I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
    – Astrid
    22 hours ago












  • Can the subsequences have repeated digits/letters?
    – Robby Cornelissen
    21 hours ago










  • How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
    – Paddy3118
    21 hours ago












  • Ahh fair let me add some more details. Please see updated question.
    – Astrid
    21 hours ago















up vote
0
down vote

favorite









up vote
0
down vote

favorite











Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.



I am interested in detecting patterns in strings/arrays such as ABCABCABCABC where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB or ABCBCBCA because of various failures.



My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC for all the given examples, as this is the longest and most common pattern.



One idea I had was simply to something like:



import numpy as np
from collections import Counter

# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])

c = Counter(map(tuple, A)).most_common()[0]

# ((1,2,3), 4)


But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter each time to assess the regularity of an appearing (or not) pattern.



Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.



Is there a better way?



EDIT



More details:




  • Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

  • Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.










share|improve this question















Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.



I am interested in detecting patterns in strings/arrays such as ABCABCABCABC where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB or ABCBCBCA because of various failures.



My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC for all the given examples, as this is the longest and most common pattern.



One idea I had was simply to something like:



import numpy as np
from collections import Counter

# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])

c = Counter(map(tuple, A)).most_common()[0]

# ((1,2,3), 4)


But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter each time to assess the regularity of an appearing (or not) pattern.



Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.



Is there a better way?



EDIT



More details:




  • Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)

  • Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be ABB. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.







python algorithm count






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 21 hours ago

























asked 22 hours ago









Astrid

66621530




66621530












  • How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
    – DatHydroGuy
    22 hours ago










  • I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
    – Astrid
    22 hours ago












  • Can the subsequences have repeated digits/letters?
    – Robby Cornelissen
    21 hours ago










  • How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
    – Paddy3118
    21 hours ago












  • Ahh fair let me add some more details. Please see updated question.
    – Astrid
    21 hours ago




















  • How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
    – DatHydroGuy
    22 hours ago










  • I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
    – Astrid
    22 hours ago












  • Can the subsequences have repeated digits/letters?
    – Robby Cornelissen
    21 hours ago










  • How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
    – Paddy3118
    21 hours ago












  • Ahh fair let me add some more details. Please see updated question.
    – Astrid
    21 hours ago


















How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago




How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string ABC, or could it be a multiple concatenation of some permutation such as BAC, or CBA, etc?
– DatHydroGuy
22 hours ago












I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago






I do not know the order either; but my assumption is that they are repeating. So the order could well be CBA or BAC but if that is the case, then I expect that to be repeating. And added complication is that at t=0 one of the sensors could be missed so it only logs CB but not a A. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago














Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago




Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago












How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago






How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago














Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago






Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago



















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371043%2falgorithm-for-finding-the-most-common-repeating-pattern-in-a-potentially-irregul%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371043%2falgorithm-for-finding-the-most-common-repeating-pattern-in-a-potentially-irregul%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Volksrepublik China

How to test boost logger output in unit testing?

Write to the output between two pipeline