Algorithm for finding the most common repeating pattern in a potentially irregular string/array
up vote
0
down vote
favorite
Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.
I am interested in detecting patterns in strings/arrays such as ABCABCABCABC
where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A
is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB
or ABCBCBCA
because of various failures.
My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC
for all the given examples, as this is the longest and most common pattern.
One idea I had was simply to something like:
import numpy as np
from collections import Counter
# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])
c = Counter(map(tuple, A)).most_common()[0]
# ((1,2,3), 4)
But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter
each time to assess the regularity of an appearing (or not) pattern.
Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.
Is there a better way?
EDIT
More details:
- Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)
- Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be
ABB
. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.
python algorithm count
|
show 5 more comments
up vote
0
down vote
favorite
Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.
I am interested in detecting patterns in strings/arrays such as ABCABCABCABC
where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A
is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB
or ABCBCBCA
because of various failures.
My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC
for all the given examples, as this is the longest and most common pattern.
One idea I had was simply to something like:
import numpy as np
from collections import Counter
# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])
c = Counter(map(tuple, A)).most_common()[0]
# ((1,2,3), 4)
But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter
each time to assess the regularity of an appearing (or not) pattern.
Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.
Is there a better way?
EDIT
More details:
- Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)
- Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be
ABB
. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.
python algorithm count
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected stringABC
, or could it be a multiple concatenation of some permutation such asBAC
, orCBA
, etc?
– DatHydroGuy
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well beCBA
orBAC
but if that is the case, then I expect that to be repeating. And added complication is that att=0
one of the sensors could be missed so it only logsCB
but not aA
. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago
|
show 5 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.
I am interested in detecting patterns in strings/arrays such as ABCABCABCABC
where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A
is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB
or ABCBCBCA
because of various failures.
My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC
for all the given examples, as this is the longest and most common pattern.
One idea I had was simply to something like:
import numpy as np
from collections import Counter
# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])
c = Counter(map(tuple, A)).most_common()[0]
# ((1,2,3), 4)
But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter
each time to assess the regularity of an appearing (or not) pattern.
Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.
Is there a better way?
EDIT
More details:
- Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)
- Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be
ABB
. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.
python algorithm count
Apologies if this question is a repeat of available questions. Haven't found one that's quite what I am looking for.
I am interested in detecting patterns in strings/arrays such as ABCABCABCABC
where this could equally well be encoded by integers. My application is such that I am working with streaming sensors where each letter in the aforementioned sequence would be one sensor (e.g. A
is a sensor). Because of sensor failure and whatnot, my sequences are not always quite periodic/repeating. They can come out like this e.g. BCABCABCAB
or ABCBCBCA
because of various failures.
My application is made harder because a priori I do not know how many sensors there are in my dataset so I require an algorithm to infer that number from the sequence (like those given above). Alas, the algorithm should yield ABC
for all the given examples, as this is the longest and most common pattern.
One idea I had was simply to something like:
import numpy as np
from collections import Counter
# ABCABCABCABC encoded with integers
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])
c = Counter(map(tuple, A)).most_common()[0]
# ((1,2,3), 4)
But this seems rather inefficient as I would have to reshape the array multiple times (and a potentially lot of times, since my sequences are very long and, recall, I do not know a priori that the length of my repeating sequence is 3), and then run Counter
each time to assess the regularity of an appearing (or not) pattern.
Other ideas including using the Knuth–Morris–Pratt algorithm along with n-grams or some combination thereof. Alternatively calculating the suffix tree.
Is there a better way?
EDIT
More details:
- Size of data: sequences of length between 1000 and 1000000 ish (though the upper bound is quite unlikely)
- Sub-sequences cannot have repeated entries, they have to be unique. I.e. a sub-sequence cannot be
ABB
. The reason is quite simple; ultimately I am interested in the temporal evolution of each individual sensor.
python algorithm count
python algorithm count
edited 21 hours ago
asked 22 hours ago
Astrid
66621530
66621530
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected stringABC
, or could it be a multiple concatenation of some permutation such asBAC
, orCBA
, etc?
– DatHydroGuy
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well beCBA
orBAC
but if that is the case, then I expect that to be repeating. And added complication is that att=0
one of the sensors could be missed so it only logsCB
but not aA
. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago
|
show 5 more comments
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected stringABC
, or could it be a multiple concatenation of some permutation such asBAC
, orCBA
, etc?
– DatHydroGuy
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well beCBA
orBAC
but if that is the case, then I expect that to be repeating. And added complication is that att=0
one of the sensors could be missed so it only logsCB
but not aA
. This is true for the whole experment, sensors dropping out here and there.
– Astrid
22 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string
ABC
, or could it be a multiple concatenation of some permutation such as BAC
, or CBA
, etc?– DatHydroGuy
22 hours ago
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string
ABC
, or could it be a multiple concatenation of some permutation such as BAC
, or CBA
, etc?– DatHydroGuy
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well be
CBA
or BAC
but if that is the case, then I expect that to be repeating. And added complication is that at t=0
one of the sensors could be missed so it only logs CB
but not a A
. This is true for the whole experment, sensors dropping out here and there.– Astrid
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well be
CBA
or BAC
but if that is the case, then I expect that to be repeating. And added complication is that at t=0
one of the sensors could be missed so it only logs CB
but not a A
. This is true for the whole experment, sensors dropping out here and there.– Astrid
22 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago
|
show 5 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371043%2falgorithm-for-finding-the-most-common-repeating-pattern-in-a-potentially-irregul%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How do you know which order the sensors will record their data in? Is there some guarantee that the longest sequence will be some multiple concatenation of the expected string
ABC
, or could it be a multiple concatenation of some permutation such asBAC
, orCBA
, etc?– DatHydroGuy
22 hours ago
I do not know the order either; but my assumption is that they are repeating. So the order could well be
CBA
orBAC
but if that is the case, then I expect that to be repeating. And added complication is that att=0
one of the sensors could be missed so it only logsCB
but not aA
. This is true for the whole experment, sensors dropping out here and there.– Astrid
22 hours ago
Can the subsequences have repeated digits/letters?
– Robby Cornelissen
21 hours ago
How big is the data, 100 items? 100,000? How many sensors? What is the largest possible repeat? It could be that regular expressions on a string might help if the size of the data is amenable.
– Paddy3118
21 hours ago
Ahh fair let me add some more details. Please see updated question.
– Astrid
21 hours ago