How to yield several requests in order in Scrapy?











up vote
0
down vote

favorite
2












I need to send my requests in order with Scrapy.



def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :

link = urljoin(path,elem)

yield Request(link)


My problem is that the requests are not in the order.
I read this question but it has no correct answer.



How should I change my code for sending the requests in order?



UPDATE 1



I used priority and changed my code to



def n1(self, response) :

#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)

yield Request(link, priority=self.prio)


And my setting for this spider is



custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}


Now the order is changed, but it's not in the order of elements in the array










share|improve this question
























  • Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
    – Elena Sh.
    Nov 16 at 11:01










  • @ElenaSh. Thanks, I tried your suggestion, and updated my question
    – parik
    Nov 16 at 11:27















up vote
0
down vote

favorite
2












I need to send my requests in order with Scrapy.



def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :

link = urljoin(path,elem)

yield Request(link)


My problem is that the requests are not in the order.
I read this question but it has no correct answer.



How should I change my code for sending the requests in order?



UPDATE 1



I used priority and changed my code to



def n1(self, response) :

#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)

yield Request(link, priority=self.prio)


And my setting for this spider is



custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}


Now the order is changed, but it's not in the order of elements in the array










share|improve this question
























  • Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
    – Elena Sh.
    Nov 16 at 11:01










  • @ElenaSh. Thanks, I tried your suggestion, and updated my question
    – parik
    Nov 16 at 11:27













up vote
0
down vote

favorite
2









up vote
0
down vote

favorite
2






2





I need to send my requests in order with Scrapy.



def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :

link = urljoin(path,elem)

yield Request(link)


My problem is that the requests are not in the order.
I read this question but it has no correct answer.



How should I change my code for sending the requests in order?



UPDATE 1



I used priority and changed my code to



def n1(self, response) :

#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)

yield Request(link, priority=self.prio)


And my setting for this spider is



custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}


Now the order is changed, but it's not in the order of elements in the array










share|improve this question















I need to send my requests in order with Scrapy.



def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :

link = urljoin(path,elem)

yield Request(link)


My problem is that the requests are not in the order.
I read this question but it has no correct answer.



How should I change my code for sending the requests in order?



UPDATE 1



I used priority and changed my code to



def n1(self, response) :

#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)

yield Request(link, priority=self.prio)


And my setting for this spider is



custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}


Now the order is changed, but it's not in the order of elements in the array







scrapy python-requests yield






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked Nov 16 at 10:50









parik

30562047




30562047












  • Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
    – Elena Sh.
    Nov 16 at 11:01










  • @ElenaSh. Thanks, I tried your suggestion, and updated my question
    – parik
    Nov 16 at 11:27


















  • Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
    – Elena Sh.
    Nov 16 at 11:01










  • @ElenaSh. Thanks, I tried your suggestion, and updated my question
    – parik
    Nov 16 at 11:27
















Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01




Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– Elena Sh.
Nov 16 at 11:01












@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27




@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27












3 Answers
3






active

oldest

votes

















up vote
1
down vote



accepted










Use a return statement instead of yield.



You don't even need to touch any setting:



from scrapy.spiders import Spider, Request

class MySpider(Spider):

name = 'toscrape.com'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)

def parse(self, response):
for url in self.urls:
return Request(url)


Output:



2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)


With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).






share|improve this answer




























    up vote
    0
    down vote













    I think concurrent request is play at here. You can try setting



    custom_settings = {
    'CONCURRENT_REQUESTS': 1
    }


    Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.






    share|improve this answer





















    • I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
      – parik
      Nov 17 at 17:54


















    up vote
    0
    down vote













    You can send the next request only after receiving the previous one:



    class MainSpider(Spider):
    urls = [
    'https://www.url1...',
    'https://www.url2...',
    'https://www.url3...',
    ]

    def start_requests(self):
    yield Request(
    url=self.urls[0],
    callback=self.parse,
    meta={'next_index': 1},
    )

    def parse(self, response):
    next_index = response.meta['next_index']

    # do something with response...

    # Process next url
    if next_index < len(self.urls):
    yield Request(
    url=self.urls[next_index],
    callback=self.parse,
    meta={'next_index': next_index+1},
    )





    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336352%2fhow-to-yield-several-requests-in-order-in-scrapy%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote



      accepted










      Use a return statement instead of yield.



      You don't even need to touch any setting:



      from scrapy.spiders import Spider, Request

      class MySpider(Spider):

      name = 'toscrape.com'
      start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

      urls = (
      'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
      )

      def parse(self, response):
      for url in self.urls:
      return Request(url)


      Output:



      2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
      2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
      2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
      2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
      2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
      2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
      2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
      2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
      2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
      2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
      2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
      2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
      2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
      2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
      2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
      2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
      2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
      2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
      2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
      2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
      2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
      2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
      2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
      2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
      2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)


      With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).






      share|improve this answer

























        up vote
        1
        down vote



        accepted










        Use a return statement instead of yield.



        You don't even need to touch any setting:



        from scrapy.spiders import Spider, Request

        class MySpider(Spider):

        name = 'toscrape.com'
        start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

        urls = (
        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
        )

        def parse(self, response):
        for url in self.urls:
        return Request(url)


        Output:



        2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
        2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
        2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
        2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
        2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
        2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
        2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
        2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
        2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
        2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
        2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
        2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
        2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
        2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
        2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
        2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
        2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
        2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
        2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
        2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
        2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
        2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
        2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
        2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
        2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)


        With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).






        share|improve this answer























          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          Use a return statement instead of yield.



          You don't even need to touch any setting:



          from scrapy.spiders import Spider, Request

          class MySpider(Spider):

          name = 'toscrape.com'
          start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

          urls = (
          'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
          )

          def parse(self, response):
          for url in self.urls:
          return Request(url)


          Output:



          2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
          2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
          2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
          2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
          2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
          2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
          2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
          2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
          2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
          2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
          2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
          2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
          2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
          2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
          2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
          2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)


          With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).






          share|improve this answer












          Use a return statement instead of yield.



          You don't even need to touch any setting:



          from scrapy.spiders import Spider, Request

          class MySpider(Spider):

          name = 'toscrape.com'
          start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

          urls = (
          'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
          )

          def parse(self, response):
          for url in self.urls:
          return Request(url)


          Output:



          2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
          2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
          2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
          2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
          2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
          2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
          2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
          2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
          2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
          2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
          2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
          2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
          2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
          2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
          2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
          2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
          2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
          2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
          2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)


          With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 12 hours ago









          Guillaume

          9191524




          9191524
























              up vote
              0
              down vote













              I think concurrent request is play at here. You can try setting



              custom_settings = {
              'CONCURRENT_REQUESTS': 1
              }


              Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.






              share|improve this answer





















              • I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
                – parik
                Nov 17 at 17:54















              up vote
              0
              down vote













              I think concurrent request is play at here. You can try setting



              custom_settings = {
              'CONCURRENT_REQUESTS': 1
              }


              Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.






              share|improve this answer





















              • I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
                – parik
                Nov 17 at 17:54













              up vote
              0
              down vote










              up vote
              0
              down vote









              I think concurrent request is play at here. You can try setting



              custom_settings = {
              'CONCURRENT_REQUESTS': 1
              }


              Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.






              share|improve this answer












              I think concurrent request is play at here. You can try setting



              custom_settings = {
              'CONCURRENT_REQUESTS': 1
              }


              Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 17 at 7:18









              Biswanath

              5,101103555




              5,101103555












              • I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
                – parik
                Nov 17 at 17:54


















              • I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
                – parik
                Nov 17 at 17:54
















              I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
              – parik
              Nov 17 at 17:54




              I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
              – parik
              Nov 17 at 17:54










              up vote
              0
              down vote













              You can send the next request only after receiving the previous one:



              class MainSpider(Spider):
              urls = [
              'https://www.url1...',
              'https://www.url2...',
              'https://www.url3...',
              ]

              def start_requests(self):
              yield Request(
              url=self.urls[0],
              callback=self.parse,
              meta={'next_index': 1},
              )

              def parse(self, response):
              next_index = response.meta['next_index']

              # do something with response...

              # Process next url
              if next_index < len(self.urls):
              yield Request(
              url=self.urls[next_index],
              callback=self.parse,
              meta={'next_index': next_index+1},
              )





              share|improve this answer

























                up vote
                0
                down vote













                You can send the next request only after receiving the previous one:



                class MainSpider(Spider):
                urls = [
                'https://www.url1...',
                'https://www.url2...',
                'https://www.url3...',
                ]

                def start_requests(self):
                yield Request(
                url=self.urls[0],
                callback=self.parse,
                meta={'next_index': 1},
                )

                def parse(self, response):
                next_index = response.meta['next_index']

                # do something with response...

                # Process next url
                if next_index < len(self.urls):
                yield Request(
                url=self.urls[next_index],
                callback=self.parse,
                meta={'next_index': next_index+1},
                )





                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  You can send the next request only after receiving the previous one:



                  class MainSpider(Spider):
                  urls = [
                  'https://www.url1...',
                  'https://www.url2...',
                  'https://www.url3...',
                  ]

                  def start_requests(self):
                  yield Request(
                  url=self.urls[0],
                  callback=self.parse,
                  meta={'next_index': 1},
                  )

                  def parse(self, response):
                  next_index = response.meta['next_index']

                  # do something with response...

                  # Process next url
                  if next_index < len(self.urls):
                  yield Request(
                  url=self.urls[next_index],
                  callback=self.parse,
                  meta={'next_index': next_index+1},
                  )





                  share|improve this answer












                  You can send the next request only after receiving the previous one:



                  class MainSpider(Spider):
                  urls = [
                  'https://www.url1...',
                  'https://www.url2...',
                  'https://www.url3...',
                  ]

                  def start_requests(self):
                  yield Request(
                  url=self.urls[0],
                  callback=self.parse,
                  meta={'next_index': 1},
                  )

                  def parse(self, response):
                  next_index = response.meta['next_index']

                  # do something with response...

                  # Process next url
                  if next_index < len(self.urls):
                  yield Request(
                  url=self.urls[next_index],
                  callback=self.parse,
                  meta={'next_index': next_index+1},
                  )






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  nicolas

                  1,436813




                  1,436813






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336352%2fhow-to-yield-several-requests-in-order-in-scrapy%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Volksrepublik China

                      How to test boost logger output in unit testing?

                      Write to the output between two pipeline