مساله:
تابعی بنویسید که یک رشته حاوی متن را بگیرد (احتمالا با علائم نگارشی و line-breaks) و یک آرایه از 3 واژه که بیشترین تکرار را داشته است را به صورت نزولی نسبت به تکرار برگرداند.
فرضیات:
- یک کلمه شامل حروف (A-Z) از که می تواند یک یا بیشتر (‘) داشته باشد. (نیازی به رسیدگی به علائم نیست)
- مطابقت نباید به حروف بزرگ و کوچک حساس باشد، و کلمه در نتیجه باید به صورت تمام کوچک (lower case) نمایش داده شود.
- روابط ممکن است خودسرانه از بین برود.
- اگر یک متن شامل کمتر از سه کلمه منحصر به فرد باشد ، باید کلمات top-2 یا top-1 برگردانده شوند ، یا اگر متنی حاوی کلمات مشابه نیست ، یک آرایه خالی بازگردانده می شود.
مثال ها:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to mind, there lived not long since one of those gentlemen that keep a lance in the lance-rack, an old buckler, a lean hack, and a greyhound for coursing. An olla of rather more beef than mutton, a salad on most nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra on Sundays, made away with three-quarters of his income.") # => ["a", "of", "on"] top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e") # => ["e", "ddd", "aa"] top_3_words(" //wont won't won't") # => ["won't", "wont"]
Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.
Assumptions:
-A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)
-Matches should be case-insensitive, and the words in the result should be lowercased.
-Ties may be broken arbitrarily.
-If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.
Examples:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to mind, there lived not long since one of those gentlemen that keep a lance in the lance-rack, an old buckler, a lean hack, and a greyhound for coursing. An olla of rather more beef than mutton, a salad on most nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra on Sundays, made away with three-quarters of his income.") # => ["a", "of", "on"] top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e") # => ["e", "ddd", "aa"] top_3_words(" //wont won't won't") # => ["won't", "wont"]
For java users, the calls will actually be in the form: TopWords.top3(String s)
, expecting you to return a List<String>
.
Bonus points (not really, but just for fun):
1-Avoid creating an array whose memory footprint is roughly as big as the input text.
2-Avoid sorting the entire array of unique words.
import java.util.Comparator; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.stream.Collectors; public class TopWords { public static List<String> top3(String s) { final Pattern pattern = Pattern.compile("[A-Za-z][A-Za-z']*"); final Matcher matcher = pattern.matcher(s.toLowerCase()); final Map<String, Integer> wordsCount = new HashMap<>(); while (matcher.find()) { String str = matcher.group(); wordsCount.put(str, wordsCount.getOrDefault(str, 1) + 1); } return wordsCount.entrySet().stream() .sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder())).limit(3) .map(Map.Entry::getKey).collect(Collectors.toList()); } }
import java.util.List; import java.util.Arrays; import java.util.Map; import java.util.function.Function; import java.util.stream.Collectors; import java.util.Comparator; import java.util.Collections; import static java.util.stream.Collectors.groupingBy; public class TopWords { public static List<String> top3(String s) { return Arrays.stream(s.toLowerCase().split("[^a-z*|']")) .filter(o -> !o.isEmpty() && !o.replace("'","").isEmpty()) .collect(groupingBy(Function.identity(), Collectors.counting())).entrySet().stream() .sorted(Collections.reverseOrder(Map.Entry.comparingByValue())) .map(Map.Entry::getKey) .limit(3) .collect(Collectors.toList()); } }
import java.util.*; import java.util.regex.*; import static java.util.stream.Collectors.toList; public class TopWords { final static private Pattern P = Pattern.compile("[a-z][a-z']*"); public static List<String> top3(String s) { Map<String,Integer> cnt = new HashMap<>(); Matcher m = P.matcher(s.toLowerCase()); while (m.find()) { cnt.put(m.group(), cnt.getOrDefault(m.group(), 0)+1); } return cnt.entrySet().stream().sorted(TopWords::compare) .limit(3).map(Map.Entry::getKey).collect(toList()); } private static int compare(Map.Entry<String,Integer> a, Map.Entry<String,Integer> b) { int u = a.getValue(), v = b.getValue(); return -Integer.compare(u,v); } }
import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.Locale; import java.util.Objects; import java.util.function.Function; import java.util.regex.Pattern; import java.util.stream.Collectors; public class TopWords { private static final Pattern PATTERN = Pattern.compile("[\\s|[^\\w|']]|_"); private static final Pattern PATTERN_WORD = Pattern.compile("[\\w+'?\\w*]+"); public static List<String> top3(final String s) { final List<String> result = new ArrayList<>(3); Arrays.stream(PATTERN.split(s)) .filter(Objects::nonNull) .filter(word -> !word.isEmpty() && !word.isBlank()) .filter(word -> PATTERN_WORD.matcher(word).matches()) .filter(word -> !word.replaceAll("'", "").isEmpty()) .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) .entrySet() .stream() .sorted((a, b) -> b.getValue().equals(a.getValue()) ? a.getKey().compareToIgnoreCase(b.getKey()) : Long.compare(b.getValue(), a.getValue())) .forEach(entry -> { if (result.size() < 3) { result.add(entry.getKey().toLowerCase(Locale.ENGLISH)); } }); return result; } }