Most frequently used words in a text

Most frequently used words in a text

مساله:

تابعی بنویسید که یک رشته حاوی متن را بگیرد (احتمالا با علائم نگارشی و line-breaks) و یک آرایه از 3 واژه که بیشترین تکرار را داشته است را به صورت نزولی نسبت به تکرار برگرداند.

فرضیات:

  • یک کلمه شامل حروف (A-Z) از که می تواند یک یا بیشتر (‘) داشته باشد. (نیازی به رسیدگی به علائم نیست)
  • مطابقت نباید به حروف بزرگ و کوچک حساس باشد، و کلمه در نتیجه باید به صورت تمام کوچک (lower case) نمایش داده شود.
  • روابط ممکن است خودسرانه از بین برود.
  • اگر یک متن شامل کمتر از سه کلمه منحصر به فرد باشد ، باید کلمات top-2 یا top-1 برگردانده شوند ، یا اگر متنی حاوی کلمات مشابه نیست ، یک آرایه خالی بازگردانده می شود.

مثال ها:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:

-A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)

-Matches should be case-insensitive, and the words in the result should be lowercased.

-Ties may be broken arbitrarily.

-If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.

Examples:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

For java users, the calls will actually be in the form: TopWords.top3(String s), expecting you to return a List<String>.

Bonus points (not really, but just for fun):

1-Avoid creating an array whose memory footprint is roughly as big as the input text.

2-Avoid sorting the entire array of unique words.


import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TopWords {
  public static List<String> top3(String s) {
    final Pattern pattern = Pattern.compile("[A-Za-z][A-Za-z']*");
    final Matcher matcher = pattern.matcher(s.toLowerCase());
    final Map<String, Integer> wordsCount = new HashMap<>();

    while (matcher.find()) {
      String str = matcher.group();
      wordsCount.put(str, wordsCount.getOrDefault(str, 1) + 1);
    }

    return wordsCount.entrySet().stream()
        .sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder())).limit(3)
        .map(Map.Entry::getKey).collect(Collectors.toList());
  }
}
import java.util.List;
import java.util.Arrays;
import java.util.Map;
import java.util.function.Function;
import java.util.stream.Collectors;
import java.util.Comparator;
import java.util.Collections;

import static java.util.stream.Collectors.groupingBy;

public class TopWords {
    
    public static List<String> top3(String s) {
        return Arrays.stream(s.toLowerCase().split("[^a-z*|']"))
                .filter(o -> !o.isEmpty() && !o.replace("'","").isEmpty())
                .collect(groupingBy(Function.identity(), Collectors.counting())).entrySet().stream()
                .sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
                .map(Map.Entry::getKey)
                .limit(3)
                .collect(Collectors.toList());
    }
}
import java.util.*;
import java.util.regex.*;
import static java.util.stream.Collectors.toList;


public class TopWords {

    final static private Pattern P = Pattern.compile("[a-z][a-z']*");
    
    public static List<String> top3(String s) {
        
        Map<String,Integer> cnt = new HashMap<>();
        Matcher m = P.matcher(s.toLowerCase());
        while (m.find()) {
            cnt.put(m.group(), cnt.getOrDefault(m.group(), 0)+1);
        }
        return cnt.entrySet().stream().sorted(TopWords::compare)
                  .limit(3).map(Map.Entry::getKey).collect(toList());
    }
    
    private static int compare(Map.Entry<String,Integer> a, Map.Entry<String,Integer> b) {
        int u = a.getValue(), v = b.getValue();
        return -Integer.compare(u,v);
    }
}
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
import java.util.Objects;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TopWords {
    private static final Pattern PATTERN = Pattern.compile("[\\s|[^\\w|']]|_");
    private static final Pattern PATTERN_WORD = Pattern.compile("[\\w+'?\\w*]+");

    public static List<String> top3(final String s) {
        final List<String> result = new ArrayList<>(3);
        Arrays.stream(PATTERN.split(s))
                .filter(Objects::nonNull)
                .filter(word -> !word.isEmpty() && !word.isBlank())
                .filter(word -> PATTERN_WORD.matcher(word).matches())
                .filter(word -> !word.replaceAll("'", "").isEmpty())
                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                .entrySet() 
                .stream()
                .sorted((a, b) -> b.getValue().equals(a.getValue()) ? a.getKey().compareToIgnoreCase(b.getKey()) : Long.compare(b.getValue(), a.getValue()))
                .forEach(entry -> {
                    if (result.size() < 3) {
                        result.add(entry.getKey().toLowerCase(Locale.ENGLISH));
                    }
                });

        return result;
    }
}

دیدگاهتان را بنویسید