爬虫
数据问题?数据库获取,消息队列获取,都可以成为数据源,爬虫!
爬取数据:(获取请求返回的页面信息,筛选出我们想要的数据就可以了!)
jsoup包!解析网页
导入依赖
1 2 3 4 5
| <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.2</version> </dependency>
|
爬虫工具类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| public class HtmlParseUtil { public static void main(String[] args){ try { new HtmlParseUtil().parseJD("java").forEach(System.out::println); } catch (IOException e) { e.printStackTrace(); } }
public List<Content> parseJD(String keywords) throws IOException { String url = "https://search.jd.com/Search?keyword="+keywords; Document parse = Jsoup.parse(new URL(url), 30000); Element element = parse.getElementById("J_goodsList"); Elements li = element.getElementsByTag("li");
ArrayList<Content> goodsList = new ArrayList<>(); for (Element el : li) { String img = el.getElementsByTag("img").eq(0).attr("source-data-lazy-img"); String price = el.getElementsByClass("p-price").eq(0).text(); String title = el.getElementsByClass("p-name").eq(0).text(); Content content = new Content(); content.setTitle(title); content.setImg(img); content.setPrice(price); goodsList.add(content); } return goodsList; } }
|