How to yield several requests in order in Scrapy?

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited yesterday

asked Nov 16 at 10:50

parik

30562047

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited yesterday

asked Nov 16 at 10:50

parik

30562047

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited yesterday

asked Nov 16 at 10:50

parik

30562047

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

scrapy python-requests yield

edited yesterday

asked Nov 16 at 10:50

parik

30562047

edited yesterday

asked Nov 16 at 10:50

parik

30562047

edited yesterday

asked Nov 16 at 10:50

parik

30562047

asked Nov 16 at 10:50

parik

30562047

asked Nov 16 at 10:50

parik

30562047

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered 12 hours ago

Guillaume

9191524

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,101103555

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered yesterday

nicolas

1,436813

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336352%2fhow-to-yield-several-requests-in-order-in-scrapy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered 12 hours ago

Guillaume

9191524

add a comment |

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered 12 hours ago

Guillaume

9191524

add a comment |

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered 12 hours ago

Guillaume

9191524

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered 12 hours ago

Guillaume

9191524

answered 12 hours ago

Guillaume

9191524

answered 12 hours ago

Guillaume

9191524

answered 12 hours ago

Guillaume

9191524

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,101103555

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,101103555

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,101103555

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,101103555

answered Nov 17 at 7:18

Biswanath

5,101103555

answered Nov 17 at 7:18

Biswanath

5,101103555

answered Nov 17 at 7:18

Biswanath

5,101103555

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered yesterday

nicolas

1,436813

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered yesterday

nicolas

1,436813

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered yesterday

nicolas

1,436813

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered yesterday

nicolas

1,436813

answered yesterday

nicolas

1,436813

answered yesterday

nicolas

1,436813

answered yesterday

nicolas

1,436813

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ftyjtk